Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation

From Bioinformatikpedia
Revision as of 22:01, 4 September 2013 by Gerkej (talk | contribs) (Calculate Structural Models)

Lab journal

Calculate and analyze correlated mutations

Not all predicted contacts are needed to predict structure from sequence. Residues that ly close to each other in their primary structure, have automatically contact due to their direct neigbourhood in the sequence. Such a contact does not give any information, but rather leads to noise in the results. We are interessted in the contacts that apear because of the secondary and tertiary structure of the proteins. This information, we get from residues that have a greater distance in the sequence, but should be in contact in space acording to their distance.


The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:

<figtable id="rhas_dist">

residues minimum lower quartile median upper quartile maximum
all -0.65 -0.23 -0.11 0.04 6.00
filtered -0.65 -0.24 -0.13 0.01 3.40
Range and score distribution of all predicted HRas contacts and HRas contacts between residues with a sequence distance of >5 residues.


<figure id="graph_ras" >

Statistics of HRas </figure>

<figure id="contact_ras" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues


Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.

The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts (<xr id="graph_ras"/>, 2) have a TP-rate of 84.6% and could be classified into

  • TP: 55
  • FP: 10

Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significanz of 0.15.

On the contact map in <xr id="contact_ras"/>, a significant pattern for a domain identification could not be observed.

<figure id="hot_ras" >

The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).


Hot Spots The 10 residues with the best top score were defined as hot spots. A mutation at this residues will have an great influence on the 3D structure of the protein. The hot spots in <xr id="hot"/> were marked on the pdb structure on the pymol visualisation in <xr id="hot_ras"/>. The residue 40 (pink) which is part of a beta sheet stabilises its sheet to a parallel lying alpha helix. The remainig nine residues (orange) ly close to each other in their secondary structure and hold three beta sheets together.

<figtable id="hot">

Position Residue Top Score
82 F 9.91
81 V 7.361
141 Y 6.67
143 E 6.52
115 G 6.51
40 Y 5.50
84 I 5.40
145 S 5.25
116 N 5.01
144 T 4.90
Top scores of 10 Hot Spots calculated from the 164 (L) best scoring pairs.


The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlaping hotspots were also ranked very differently of both programs.


The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The Histogram of the CN score frequencey distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">

residues minimum lower quartile median upper quartile maximum
all -0.66 -0.15 -0.04 0.09 4.36
filtered -0.66 -0.15 -0.04 0.08 4.00
Range and score distribution of all predicted Glucocerebrosidase contacts and Glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).


<figure id="graphic_gluco" >

Statistics of glucocerebrosidase. </figure>

<figure id="contact_gluco" >

Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues.


<figure id="hot_gluco" >

Pymol visualisation of 1OGS.pdb with coloured top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are coloured red. The remaining hot spots are highlighted in rose.


After filtering out the neighbour contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs (<xr id="graphic_gluco"/>, 2). These 247 pairs show a TP rate of 39% :

  • TP: 97
  • FP: 150

The Pearson correlation (0.2)cannot distinguish an correlation between the prediction state (TP/FP) and the CN score.

Hot Spots The 10 best hot spots of Glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualised in <xr id="hot_gluco"/>. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the T-Coffe alignment 1 (alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the uniprot sequence), is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher's disease. This valine as well as its direct neighbour cysteine seems to be very important to the structure and funktion of the protein. Together with the other non disease causing hot spots they form an densely packed hot spot area. For another residue phenylalanine (pdb position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (pdb position 53), both residues stabilizes two beta sheets. The two residues are each located in one strand and ly very close in the secondary structure. This two beta sheets play a significant role, as a different arangement of them causes Gaucher disease.

<figtable id="hot2">

Position Residue Top Score Mutation
25 S 12.49 -
50 E 11.98 -
15 V 11.64 dc
14 V 11.47 -
4 C 11.45 -
37 F 11.38 dc/ndc
53 M 10.66 dc
26 F 10.62 -
16 C 10.55 dc
22 Y 10.33 -
Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from HGMD as well as dbSNP and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the pdb positions. The position differs from the position of the uniprot sequence for 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.


Compared to the of EVcouplings prediction, both methods have only 14% overlaping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlaping hot spots of both programs.

Calculate Structural Models

We generated structural models of HRas (P01112) by using Evfold. We run Evfold with different scores (DI, PLM). Furthermore we considered various numbers of contacts. Dependent on the sequence length, we instructed the programm to concentrate on 20/40/60/80/100% of contacts. According to this the structures differ in their similarity to the reference structure (121p). In <xr id="const"/> the maps of the predicted contacts can be seen.

<figure id="const" >

Contact maps of HRas created by Evfold. </figure>

<figure id="box" >

Comparison of all predictions of Evfold. For each score and number of constraints, 5 models were grouped in a box of the plot. Models based on the DI score are colored orange. The remaining ones (blue) were calculated with the PLM score. For each score, the models with the same constraint number, wich had the best rmsds, were darkened.


We compared different predicted structures to find the best prediction of Evfold. A great number of rmsds calculated from different predicted models were visualized in the picture on the right. <xr id="box"/> shows that, except for 32 contacts (~20%), all rmsd of PLM models are lower than those from DI models. An average of rmsd values dependent on the scores and constrains are listed in <xr id="rmsd"/>. As the average rmsd values for models with 32 contraints, are the worst rmsd of each score, we did not consider them anymore. A look into an alignment of those models with 121p affirmed our decision as the helices were arranged completly different. While PLM models with 129 constraints show the best similarity to the reference structer, the best aligned models with DI score consider only 97 constrains. But, both scores produce worse predictions by using 100% of contacts. In general, it is recommendable to use ~70% of contacts to predict a structure most similar to the original sequence.

We also looked at the alignment of the refernce structure to PLM model 129_4 and DI model 97_4 (in <xr id="align"/>). Both predicted structures have the best rmsd compared to the remaining models based on the same score. Not only from the rmsd value (2.62), but on the first sight it is obvious that the PLM score produced the better structure (see gallery below). While the PLM model (darkblue) predicts most helices and sheets, the DI model (red) cannot identify sheets but predicts them as loops. This is best observable for the three sheets at the bottom right in the two alignment pictures. Also the helices are more deviating from the reference structure (turquoise) than the ones of the PLM model. As the average rmsd value for the PLM models with 129 constrains (3.03) is better than even the best rmsd of all DI models (3.96), we will use the PLM score for future predictions.

<figtable id="rmsd">

Score 20% 40% 60% 80% 100%
DI 8.43 6.84 4.36 5.10 7.06
PLM 9.92 4.23 3.24 3.03 3.42
Average RMSD (C alpha) of five models with the same score and number of constraints, generated by EVfold.


<figure id="align" >

Alignment of the original HRas structure with the best predictions of Evfold. </figure>