Difference between revisions of "Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation"

From Bioinformatikpedia
(Calculate Structural Models)
(Calculate Structural Models)
 
(47 intermediate revisions by 2 users not shown)
Line 2: Line 2:
   
 
==Calculate and analyze correlated mutations==
 
==Calculate and analyze correlated mutations==
Not all predicted contacts are needed to predict structure from sequence. Residues that ly close to each other in their primary structure, have automatically contact due to their direct neigbourhood in the sequence. Such a contact does not give any information, but rather leads to noise in the results. We are interessted in the contacts that apear because of the secondary and tertiary structure of the proteins. This information, we get from residues that have a greater distance in the sequence, but should be in contact in space acording to their distance.
+
Not all predicted contacts are needed to predict structure from sequence. Residues that lie close to each other in their primary structure, have automatically contact due to their direct neighborhood in the sequence. Such a contact does not give us any information, but rather leads to noise in the results. We are only interested in the contacts that appear because of the secondary and tertiary structure of the proteins. We get this information from residues that have a greater distance in the sequence, but should be in contact in space according to their distance.
   
   
Line 26: Line 26:
 
</figtable>
 
</figtable>
   
  +
  +
<figure id="graph_ras" >
 
<gallery caption="" widths="400px" heights="400px" perrow="2">
 
<gallery caption="" widths="400px" heights="400px" perrow="2">
 
File:ras_dist.jpeg|Histogram of the distribution of CN scores for HRas. Only CN of residues with at least a distance of 5 residues were considered.
 
File:ras_dist.jpeg|Histogram of the distribution of CN scores for HRas. Only CN of residues with at least a distance of 5 residues were considered.
 
File:ras_box.jpeg|Boxplot of the CN and states (TP/FP) of high scoring pairs of HRas (Pearson: 0.15).
 
File:ras_box.jpeg|Boxplot of the CN and states (TP/FP) of high scoring pairs of HRas (Pearson: 0.15).
 
</gallery>
 
</gallery>
  +
<small>'''<caption>''' Statistics of HRas</caption></small>
  +
</figure>
   
  +
[[Image:ras_map.jpeg|thumb|right|400px|Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues]]
 
  +
<figure id="contact_ras" >
  +
[[Image:ras_map.jpeg|thumb|right|400px|'''<caption>''' Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues</caption>]]
  +
</figure>
   
 
Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.
 
Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.
   
   
The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts have a TP-rate of 84.6% and could be classified into
+
The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts (<xr id="graph_ras"/>, 2) have a TP-rate of 84.6% and could be classified into
 
*TP: 55
 
*TP: 55
 
*FP: 10
 
*FP: 10
   
Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significanz of 0.15.
+
Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significance of 0.15.
 
On the contact map, a significant pattern for a domain identification could not be observed.
 
   
  +
On the contact map in <xr id="contact_ras"/>, a significant pattern for a domain identification could not be observed.
[[Image:ras_hots.png|thumb|right|400px|The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).]]
 
   
  +
<figure id="hot_ras" >
  +
[[Image:ras_hots.png|thumb|right|400px|'''<caption>''' The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).</caption>]]
  +
</figure>
   
 
'''Hot Spots'''
 
'''Hot Spots'''
The 10 residues with the best top score were defined as hot spots. A mutation at this residues will have an great influence on the 3D structure of the protein.
+
The 10 residues with the best top scores were defined as hot spots. A mutation at this residues will have a great influence on the 3D structure of the protein.
The hot spots in <xr id="hot"/> were marked on the pdb structure on the pymol visualisation on the left. The residue 40 (pink) which is part of a beta sheet stabilises its sheet to a parallel lying alpha helix. The remainig nine residues (orange) ly close to each other in their secondary structure and hold three beta sheets together.
+
The hot spots in <xr id="hot"/> were marked on the pdb structure on the Pymol visualization in <xr id="hot_ras"/>. The residue 40 (pink) which is part of a beta sheet stabilizes its sheet to a parallel lying alpha helix. The remaining nine residues (orange) lie close to each other in their secondary structure and hold three beta sheets together.
   
 
<figtable id="hot">
 
<figtable id="hot">
Line 83: Line 91:
   
   
The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlaping hotspots were also ranked very differently of both programs.
+
The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlapping hotspots were also ranked very differently by both programs.
   
 
===Glucocerebrosidase===
 
===Glucocerebrosidase===
The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The Histogram of the CN score frequencey distribution has a very high gradient due to the great number of contacts with an CN score of 0.
+
The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The histogram of the CN score frequency distribution has a very high gradient due to the great number of contacts with an CN score of 0.
 
<figtable id="gluco_dist">
 
<figtable id="gluco_dist">
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
Line 101: Line 109:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' Range and score distribution of all predicted Glucocerebrosidase contacts and Glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).</caption></small></center>
+
<center><small>'''<caption>''' Range and score distribution of all predicted glucocerebrosidase contacts and glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).</caption></small></center>
 
</figtable>
 
</figtable>
   
  +
  +
<figure id="graphic_gluco" >
 
<gallery caption="" widths="400px" heights="400px" perrow="2">
 
<gallery caption="" widths="400px" heights="400px" perrow="2">
File:gluco_dist.jpeg|Histogram of the distribution of CN scores for Glucocerebrosidase. Only CN of residues with at least a distance of 5 residues were considered.
+
File:gluco_dist.jpeg|Histogram of the distribution of CN scores for glucocerebrosidase. Only CN of residues with at least a distance of 5 residues were considered.
File:glucocerebrosidase_box.jpeg|Boxplot of the CN and states (TP/FP) of high scoring pairs of Glucocerebrosidase (Pearson: 0.20).
+
File:glucocerebrosidase_box.jpeg|Boxplot of the CN and states (TP/FP) of high scoring pairs of glucocerebrosidase (Pearson: 0.20).
 
</gallery>
 
</gallery>
  +
<small>'''<caption>''' Statistics of glucocerebrosidase. </caption></small>
  +
</figure>
   
  +
<figure id="contact_gluco" >
[[Image:glucocerebrosidase_map.jpeg|thumb|right|400px|Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues]]
 
  +
[[Image:glucocerebrosidase_map.jpeg|thumb|right|400px|'''<caption>''' Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the PDB contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues.</caption>]]
  +
</figure>
   
[[Image:gluco_mut.png|thumb|right|400px| Pymol visualisation of 1OGS.pdb with coloured top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are coloured red. The remaining hot spots are highlighted in rose.]]
 
   
  +
<figure id="hot_gluco" >
After filtering out the neighbour contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs. These 247 pairs show a TP rate of 39% :
 
  +
[[Image:gluco_mut.png|thumb|right|400px|'''<caption>''' Pymol visualization of 1OGS.pdb with colored top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are colored red. The remaining hot spots are highlighted in rose.</caption>]]
  +
</figure>
  +
  +
After filtering out the neighbor contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs (<xr id="graphic_gluco"/>, 2). These 247 pairs show a TP rate of 39% :
 
*TP: 97
 
*TP: 97
 
*FP: 150
 
*FP: 150
The Pearson correlation (0.2)cannot distinguish an correlation between the prediction state (TP/FP) and the CN score.
+
The Pearson correlation (0.2) cannot distinguish a correlation between the prediction state (TP/FP) and the CN score.
 
   
   
 
'''Hot Spots'''
 
'''Hot Spots'''
The 10 best hot spots of Glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualised on the image on the left. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/images/e/e2/Set1_coffe.png T-Coffe alignment 1 ](alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the uniprot sequence), is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher's disease. This valine as well as its direct neighbour cysteine seems to be very important to the structure and funktion of the protein. Together with the other non disease causing hot spots they form an densely packed hot spot area.
+
The 10 best hot spots of glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualized in <xr id="hot_gluco"/>. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/images/e/e2/Set1_coffe.png T-Coffee alignment 1](alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the UniProt sequence) is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher disease. This valine as well as its direct neighbor cysteine seems to be very important to the structure and function of the protein. Together with the other non disease causing hot spots they form a densely packed hot spot area.
For another residue phenylalanine (pdb position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (pdb position 53), both residues stabilizes two beta sheets. The two residues are each located in one strand and ly very close in the secondary structure. This two beta sheets play a significant role, as a different arangement of them causes Gaucher disease.
+
For another residue phenylalanine (PDB position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (PDB position 53), both residues stabilize two beta sheets. The two residues are each located in one strand and lie very close in the secondary structure. These two beta sheets play a significant role, as a different arrangement of them causes Gaucher disease.
   
 
<figtable id="hot2">
 
<figtable id="hot2">
Line 152: Line 168:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from <span style="color:#FF0040">'''HGMD'''</span> as well as <span style="color:#04B431">'''dbSNP'''</span> and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the pdb positions. The position differs from the position of the uniprot sequence for 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.</caption></small></center>
+
<center><small>'''<caption>''' Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from <span style="color:#FF0040">'''HGMD'''</span> as well as <span style="color:#04B431">'''dbSNP'''</span> and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the PDB positions. The position differs from the position of the UniProt sequence by 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.</caption></small></center>
 
</figtable>
 
</figtable>
 
 
   
Compared to the of EVcouplings prediction, both methods have only 14% overlaping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlaping hot spots of both programs.
+
Compared to the of EVcouplings prediction, both methods have only 14% overlapping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlapping hot spots of both programs.
   
 
==Calculate Structural Models==
 
==Calculate Structural Models==
   
  +
We generated structural models of HRas (P01112) by using Evfold. We run Evfold with different scores (DI, PLM). Furthermore, we considered various numbers of contacts. Dependent on the sequence length, we instructed the program to concentrate on 20/40/60/80/100% of contacts. According to this the structures differ in their similarity to the reference structure (121p). In <xr id="const"/> the maps of the predicted contacts can be seen.
  +
  +
  +
<figure id="const" >
 
<gallery caption="" widths="300px" heights="300px" perrow="5">
 
<gallery caption="" widths="300px" heights="300px" perrow="5">
File:DI_20.png| blub
+
File:DI_20.png| 32 constrains based on DI score
File:DI_40.png| blub
+
File:DI_40.png| 64 constrains based on DI score
File:DI_60.png| blub
+
File:DI_60.png|97 constrains based on DI score
File:DI_80.png| blub
+
File:DI_80.png| 129 constrains based on DI score
File:DI_100.png| blub
+
File:DI_100.png| 161 constrains based on DI score
File:PLM20.png| blub
+
File:PLM20.png| 32 constrains based on PLM score
File:PLM40.png| blub
+
File:PLM40.png| 64 constrains based on PLM score
File:PLM60.png| blub
+
File:PLM60.png| 97 constrains based on PLM score
File:PLM80.png| blub
+
File:PLM80.png| 129 constrains based on PLM score
File:PLM100.png| blub
+
File:PLM100.png| 161 constrains based on PLM score
 
</gallery>
 
</gallery>
  +
<small>'''<caption>''' Contact maps of HRas created by Evfold. </caption></small>
  +
</figure>
   
  +
  +
<figure id="box" >
  +
[[Image:boxplot-DI-PLM.jpg|thumb|right|500px|'''<caption>'''Comparison of all predictions of Evfold. For each score and number of constraints, 5 models were grouped in a box of the plot. Models based on the DI score are colored orange. The remaining ones (blue) were calculated with the PLM score. For each score, the models with the same constraint number, which had the best RMSD values, were darkened.</caption> ]]
  +
</figure>
  +
  +
  +
We compared different predicted structures to find the best prediction of Evfold. A great number of RMSDs calculated from different predicted models were visualized in <xr id="box"/>. It shows that, except for 32 contacts (~20%), all RMSD values of PLM models are lower than those from DI models. The average of the RMSD values dependent on the scores and constrains are listed in <xr id="rmsd"/>. As the average RMSD values for models with 32 constraints are the worst RMSD of each score, we did not consider them anymore. A look into an alignment of those models with 121p affirmed our decision as the helices were arranged completely differently. While PLM models with 129 constraints show the best similarity to the reference structure, the best aligned models with DI score consider only 97 constrains. However, both scores produce worse predictions by using 100% of contacts. In general, it is recommendable to use ~70% of contacts to predict a structure most similar to the original sequence.
  +
  +
We also looked at the alignment of the reference structure to PLM model 129_4 and DI model 97_4 (in <xr id="align"/>). Both predicted structures have the best RMSD compared to the remaining models based on the same score. Not only from the RMSD value (2.62), but on the first sight it is obvious that the PLM score produced the better structure (<xr id="align"/>). While the PLM model (darkblue) predicts most helices and sheets, the DI model (red) cannot identify sheets but predicts them as loops. This is best observable for the three sheets at the bottom right in the two alignment pictures. Also the helices predicted by the DI model are more deviating from the reference structure (turquoise) than the ones of the PLM model. As the average RMSD value for the PLM models with 129 constrains (3.03) is better than even the best RMSD of all DI models (3.96), we will use the PLM score for future predictions.
   
   
 
<figtable id="rmsd">
 
<figtable id="rmsd">
{| border="1" cellpadding="3" cellspacing="0" align="center"
+
{| border="1" cellpadding="6" cellspacing="0" align="center"
 
|-
 
|-
 
!colspan="1" style="background:#adceff;" |
 
!colspan="1" style="background:#adceff;" |
!colspan="2" style="background:#adceff;" | Contacts
+
!colspan="5" style="background:#adceff;" | Constraints
 
|-
 
|-
 
! colspan="1" style="background:#adceff;" | Score
 
! colspan="1" style="background:#adceff;" | Score
Line 188: Line 219:
 
! colspan="1" style="background:#adceff;" | 100%
 
! colspan="1" style="background:#adceff;" | 100%
 
|-
 
|-
| DI || |||| || ||
+
| DI ||8.43 ||6.84|| 4.36 ||5.10|| 7.06
 
|-
 
|-
| PLM || || || || ||
+
| PLM ||9.92|| 4.23 ||3.24|| 3.03 ||3.42
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' RMSD of C alpha atoms of models generated by EV couplings. The models differ in their score (DI/PLM) and their number of contacts (40%/60/100% of L).</caption></small></center>
+
<center><small>'''<caption>''' Average RMSD (C-alpha) of five models with the same score and number of constraints, generated by EVfold. </caption></small></center>
 
</figtable>
 
</figtable>
  +
  +
  +
<figure id="align" >
  +
<gallery caption="" widths="500px" heights="300px" perrow="2">
  +
File:align_DI.png| HRas (turquoise) aligned to best predictions based on DI score (red). Their RMSD is 3.96.
  +
File:align_PLM.png| HRas (turquoise) aligned to best predictions based on PLM score (blue). Their RMSD is 2.62.
  +
</gallery>
  +
<small>'''<caption>''' Alignment of the original HRas structure with the best predictions of Evfold.</caption></small>
  +
</figure>

Latest revision as of 00:05, 6 September 2013

Lab journal

Calculate and analyze correlated mutations

Not all predicted contacts are needed to predict structure from sequence. Residues that lie close to each other in their primary structure, have automatically contact due to their direct neighborhood in the sequence. Such a contact does not give us any information, but rather leads to noise in the results. We are only interested in the contacts that appear because of the secondary and tertiary structure of the proteins. We get this information from residues that have a greater distance in the sequence, but should be in contact in space according to their distance.


HRas

The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:

<figtable id="rhas_dist">

residues minimum lower quartile median upper quartile maximum
all -0.65 -0.23 -0.11 0.04 6.00
filtered -0.65 -0.24 -0.13 0.01 3.40
Range and score distribution of all predicted HRas contacts and HRas contacts between residues with a sequence distance of >5 residues.

</figtable>


<figure id="graph_ras" >

Statistics of HRas </figure>


<figure id="contact_ras" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues

</figure>

Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.


The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts (<xr id="graph_ras"/>, 2) have a TP-rate of 84.6% and could be classified into

  • TP: 55
  • FP: 10

Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significance of 0.15.

On the contact map in <xr id="contact_ras"/>, a significant pattern for a domain identification could not be observed.

<figure id="hot_ras" >

The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).

</figure>

Hot Spots The 10 residues with the best top scores were defined as hot spots. A mutation at this residues will have a great influence on the 3D structure of the protein. The hot spots in <xr id="hot"/> were marked on the pdb structure on the Pymol visualization in <xr id="hot_ras"/>. The residue 40 (pink) which is part of a beta sheet stabilizes its sheet to a parallel lying alpha helix. The remaining nine residues (orange) lie close to each other in their secondary structure and hold three beta sheets together.

<figtable id="hot">

Position Residue Top Score
82 F 9.91
81 V 7.361
141 Y 6.67
143 E 6.52
115 G 6.51
40 Y 5.50
84 I 5.40
145 S 5.25
116 N 5.01
144 T 4.90
Top scores of 10 Hot Spots calculated from the 164 (L) best scoring pairs.

</figtable>


The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlapping hotspots were also ranked very differently by both programs.

Glucocerebrosidase

The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The histogram of the CN score frequency distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">

residues minimum lower quartile median upper quartile maximum
all -0.66 -0.15 -0.04 0.09 4.36
filtered -0.66 -0.15 -0.04 0.08 4.00
Range and score distribution of all predicted glucocerebrosidase contacts and glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).

</figtable>


<figure id="graphic_gluco" >

Statistics of glucocerebrosidase. </figure>

<figure id="contact_gluco" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the PDB contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues.

</figure>


<figure id="hot_gluco" >

Pymol visualization of 1OGS.pdb with colored top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are colored red. The remaining hot spots are highlighted in rose.

</figure>

After filtering out the neighbor contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs (<xr id="graphic_gluco"/>, 2). These 247 pairs show a TP rate of 39% :

  • TP: 97
  • FP: 150

The Pearson correlation (0.2) cannot distinguish a correlation between the prediction state (TP/FP) and the CN score.


Hot Spots The 10 best hot spots of glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualized in <xr id="hot_gluco"/>. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the T-Coffee alignment 1(alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the UniProt sequence) is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher disease. This valine as well as its direct neighbor cysteine seems to be very important to the structure and function of the protein. Together with the other non disease causing hot spots they form a densely packed hot spot area. For another residue phenylalanine (PDB position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (PDB position 53), both residues stabilize two beta sheets. The two residues are each located in one strand and lie very close in the secondary structure. These two beta sheets play a significant role, as a different arrangement of them causes Gaucher disease.

<figtable id="hot2">

Position Residue Top Score Mutation
25 S 12.49 -
50 E 11.98 -
15 V 11.64 dc
14 V 11.47 -
4 C 11.45 -
37 F 11.38 dc/ndc
53 M 10.66 dc
26 F 10.62 -
16 C 10.55 dc
22 Y 10.33 -
Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from HGMD as well as dbSNP and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the PDB positions. The position differs from the position of the UniProt sequence by 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.

</figtable>


Compared to the of EVcouplings prediction, both methods have only 14% overlapping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlapping hot spots of both programs.

Calculate Structural Models

We generated structural models of HRas (P01112) by using Evfold. We run Evfold with different scores (DI, PLM). Furthermore, we considered various numbers of contacts. Dependent on the sequence length, we instructed the program to concentrate on 20/40/60/80/100% of contacts. According to this the structures differ in their similarity to the reference structure (121p). In <xr id="const"/> the maps of the predicted contacts can be seen.


<figure id="const" >

Contact maps of HRas created by Evfold. </figure>


<figure id="box" >

Comparison of all predictions of Evfold. For each score and number of constraints, 5 models were grouped in a box of the plot. Models based on the DI score are colored orange. The remaining ones (blue) were calculated with the PLM score. For each score, the models with the same constraint number, which had the best RMSD values, were darkened.

</figure>


We compared different predicted structures to find the best prediction of Evfold. A great number of RMSDs calculated from different predicted models were visualized in <xr id="box"/>. It shows that, except for 32 contacts (~20%), all RMSD values of PLM models are lower than those from DI models. The average of the RMSD values dependent on the scores and constrains are listed in <xr id="rmsd"/>. As the average RMSD values for models with 32 constraints are the worst RMSD of each score, we did not consider them anymore. A look into an alignment of those models with 121p affirmed our decision as the helices were arranged completely differently. While PLM models with 129 constraints show the best similarity to the reference structure, the best aligned models with DI score consider only 97 constrains. However, both scores produce worse predictions by using 100% of contacts. In general, it is recommendable to use ~70% of contacts to predict a structure most similar to the original sequence.

We also looked at the alignment of the reference structure to PLM model 129_4 and DI model 97_4 (in <xr id="align"/>). Both predicted structures have the best RMSD compared to the remaining models based on the same score. Not only from the RMSD value (2.62), but on the first sight it is obvious that the PLM score produced the better structure (<xr id="align"/>). While the PLM model (darkblue) predicts most helices and sheets, the DI model (red) cannot identify sheets but predicts them as loops. This is best observable for the three sheets at the bottom right in the two alignment pictures. Also the helices predicted by the DI model are more deviating from the reference structure (turquoise) than the ones of the PLM model. As the average RMSD value for the PLM models with 129 constrains (3.03) is better than even the best RMSD of all DI models (3.96), we will use the PLM score for future predictions.


<figtable id="rmsd">

Constraints
Score 20% 40% 60% 80% 100%
DI 8.43 6.84 4.36 5.10 7.06
PLM 9.92 4.23 3.24 3.03 3.42
Average RMSD (C-alpha) of five models with the same score and number of constraints, generated by EVfold.

</figtable>


<figure id="align" >

Alignment of the original HRas structure with the best predictions of Evfold. </figure>