Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation

From Bioinformatikpedia

Lab journal

Calculate and analyze correlated mutations

Not all predicted contacts are needed to predict structure from sequence. Residues that lie close to each other in their primary structure, have automatically contact due to their direct neighborhood in the sequence. Such a contact does not give us any information, but rather leads to noise in the results. We are only interested in the contacts that appear because of the secondary and tertiary structure of the proteins. We get this information from residues that have a greater distance in the sequence, but should be in contact in space according to their distance.


The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:

<figtable id="rhas_dist">

residues minimum lower quartile median upper quartile maximum
all -0.65 -0.23 -0.11 0.04 6.00
filtered -0.65 -0.24 -0.13 0.01 3.40
Range and score distribution of all predicted HRas contacts and HRas contacts between residues with a sequence distance of >5 residues.


<figure id="graph_ras" >

Statistics of HRas </figure>

<figure id="contact_ras" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues


Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.

The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts (<xr id="graph_ras"/>, 2) have a TP-rate of 84.6% and could be classified into

  • TP: 55
  • FP: 10

Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significance of 0.15.

On the contact map in <xr id="contact_ras"/>, a significant pattern for a domain identification could not be observed.

<figure id="hot_ras" >

The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).


Hot Spots The 10 residues with the best top scores were defined as hot spots. A mutation at this residues will have a great influence on the 3D structure of the protein. The hot spots in <xr id="hot"/> were marked on the pdb structure on the Pymol visualization in <xr id="hot_ras"/>. The residue 40 (pink) which is part of a beta sheet stabilizes its sheet to a parallel lying alpha helix. The remaining nine residues (orange) lie close to each other in their secondary structure and hold three beta sheets together.

<figtable id="hot">

Position Residue Top Score
82 F 9.91
81 V 7.361
141 Y 6.67
143 E 6.52
115 G 6.51
40 Y 5.50
84 I 5.40
145 S 5.25
116 N 5.01
144 T 4.90
Top scores of 10 Hot Spots calculated from the 164 (L) best scoring pairs.


The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlapping hotspots were also ranked very differently by both programs.


The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The histogram of the CN score frequency distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">

residues minimum lower quartile median upper quartile maximum
all -0.66 -0.15 -0.04 0.09 4.36
filtered -0.66 -0.15 -0.04 0.08 4.00
Range and score distribution of all predicted glucocerebrosidase contacts and glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).


<figure id="graphic_gluco" >

Statistics of glucocerebrosidase. </figure>

<figure id="contact_gluco" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the PDB contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues.


<figure id="hot_gluco" >

Pymol visualization of 1OGS.pdb with colored top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are colored red. The remaining hot spots are highlighted in rose.


After filtering out the neighbor contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs (<xr id="graphic_gluco"/>, 2). These 247 pairs show a TP rate of 39% :

  • TP: 97
  • FP: 150

The Pearson correlation (0.2) cannot distinguish a correlation between the prediction state (TP/FP) and the CN score.

Hot Spots The 10 best hot spots of glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualized in <xr id="hot_gluco"/>. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the T-Coffee alignment 1(alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the UniProt sequence) is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher disease. This valine as well as its direct neighbor cysteine seems to be very important to the structure and function of the protein. Together with the other non disease causing hot spots they form a densely packed hot spot area. For another residue phenylalanine (PDB position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (PDB position 53), both residues stabilize two beta sheets. The two residues are each located in one strand and lie very close in the secondary structure. These two beta sheets play a significant role, as a different arrangement of them causes Gaucher disease.

<figtable id="hot2">

Position Residue Top Score Mutation
25 S 12.49 -
50 E 11.98 -
15 V 11.64 dc
14 V 11.47 -
4 C 11.45 -
37 F 11.38 dc/ndc
53 M 10.66 dc
26 F 10.62 -
16 C 10.55 dc
22 Y 10.33 -
Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from HGMD as well as dbSNP and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the PDB positions. The position differs from the position of the UniProt sequence by 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.


Compared to the of EVcouplings prediction, both methods have only 14% overlapping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlapping hot spots of both programs.

Calculate Structural Models

We generated structural models of HRas (P01112) by using Evfold. We run Evfold with different scores (DI, PLM). Furthermore, we considered various numbers of contacts. Dependent on the sequence length, we instructed the program to concentrate on 20/40/60/80/100% of contacts. According to this the structures differ in their similarity to the reference structure (121p). In <xr id="const"/> the maps of the predicted contacts can be seen.

<figure id="const" >

Contact maps of HRas created by Evfold. </figure>

<figure id="box" >

Comparison of all predictions of Evfold. For each score and number of constraints, 5 models were grouped in a box of the plot. Models based on the DI score are colored orange. The remaining ones (blue) were calculated with the PLM score. For each score, the models with the same constraint number, which had the best RMSD values, were darkened.


We compared different predicted structures to find the best prediction of Evfold. A great number of RMSDs calculated from different predicted models were visualized in <xr id="box"/>. It shows that, except for 32 contacts (~20%), all RMSD values of PLM models are lower than those from DI models. The average of the RMSD values dependent on the scores and constrains are listed in <xr id="rmsd"/>. As the average RMSD values for models with 32 constraints are the worst RMSD of each score, we did not consider them anymore. A look into an alignment of those models with 121p affirmed our decision as the helices were arranged completely differently. While PLM models with 129 constraints show the best similarity to the reference structure, the best aligned models with DI score consider only 97 constrains. However, both scores produce worse predictions by using 100% of contacts. In general, it is recommendable to use ~70% of contacts to predict a structure most similar to the original sequence.

We also looked at the alignment of the reference structure to PLM model 129_4 and DI model 97_4 (in <xr id="align"/>). Both predicted structures have the best RMSD compared to the remaining models based on the same score. Not only from the RMSD value (2.62), but on the first sight it is obvious that the PLM score produced the better structure (<xr id="align"/>). While the PLM model (darkblue) predicts most helices and sheets, the DI model (red) cannot identify sheets but predicts them as loops. This is best observable for the three sheets at the bottom right in the two alignment pictures. Also the helices predicted by the DI model are more deviating from the reference structure (turquoise) than the ones of the PLM model. As the average RMSD value for the PLM models with 129 constrains (3.03) is better than even the best RMSD of all DI models (3.96), we will use the PLM score for future predictions.

<figtable id="rmsd">

Score 20% 40% 60% 80% 100%
DI 8.43 6.84 4.36 5.10 7.06
PLM 9.92 4.23 3.24 3.03 3.42
Average RMSD (C-alpha) of five models with the same score and number of constraints, generated by EVfold.


<figure id="align" >

Alignment of the original HRas structure with the best predictions of Evfold. </figure>