Difference between revisions of "Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation"

Calculate and analyze correlated mutations

Not all predicted contacts are needed to predict structure from sequence. Residues that ly close to each other in their primary structure, have automatically contact due to their direct neigbourhood in the sequence. Such a contact does not give any information, but rather leads to noise in the results. We are interessted in the contacts that apear because of the secondary and tertiary structure of the proteins. This information, we get from residues that have a greater distance in the sequence, but should be in contact in space acording to their distance.

HRas

The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:

<figtable id="rhas_dist">

residues minimum lower quartile median upper quartile maximum
all -0.65 -0.23 -0.11 0.04 6.00
filtered -0.65 -0.24 -0.13 0.01 3.40
Range and score distribution of all predicted HRas contacts and HRas contacts between residues with a sequence distance of >5 residues.

</figtable>

Contact map of predicted contacts based on freecontact and the real contacts documented in the 121p.pdb (blue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues

Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.

The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts have a TP-rate of 84.6% and could be classified into

• TP: 55
• FP: 10

Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significanz of 0.15.

On the contact map, a significant pattern for a domain identification could not be observed.

The 10 hot spots of HRas are colored in orange and pink in the 121p structure (blue).

Hot Spots The 10 residues with the best top score were defined as hot spots. A mutation at this residues will have an great influence on the 3D structure of the protein. The hot spots in <xr id="hot"/> were marked on the pdb structure on the pymol visualisation on the left. The residue 40 (pink) which is part of a beta sheet stabilises its sheet to a parallel lying alpha helix. The remainig nine residues (orange) ly close to each other in their secondary structure and hold three beta sheets together.

<figtable id="hot">

Position Residue Top Score
82 F 9.91
81 V 7.361
141 Y 6.67
143 E 6.52
115 G 6.51
40 Y 5.50
84 I 5.40
145 S 5.25
116 N 5.01
144 T 4.90
Top scores of 10 Hot Spots calculated from the 164 (L) best scoring pairs.

</figtable>

The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlaping hotspots were also ranked very differently of both programs.

Glucocerebrosidase

The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The Histogram of the CN score frequencey distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">

residues minimum lower quartile median upper quartile maximum
all -0.66 -0.15 -0.04 0.09 4.36
filtered -0.66 -0.15 -0.04 0.08 4.00
Range and score distribution of all predicted Glucocerebrosidase contacts and Glucocerebrosidase contacts between residues with a sequence distance of >5 residues (filtered).

</figtable>

Contact map of predicted contacts based on freecontact and the real contacts documented in the 1OGS.pdb (light and darkblue). Contrary to the pdb contact, the calculated TP (darkblue) and the FP (red) only consider residues with a sequence distance of at least 5 residues
Pymol visualisation of 1OGS.pdb with coloured top 10 hot spots. The best hot spot is marked darkblue. Hot spots that are classified as disease causing mutations of HGMD or dbSNP are coloured red. The remaining hot spots are highlighted in rose.

After filtering out the neighbour contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs. These 247 pairs show a TP rate of 39% :

• TP: 97
• FP: 150

The Pearson correlation (0.2)cannot distinguish an correlation between the prediction state (TP/FP) and the CN score.

Hot Spots The 10 best hot spots of Glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/> and visualised on the image on the left. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the T-Coffe alignment 1 (alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the uniprot sequence), is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher's disease. This valine as well as its direct neighbour cysteine seems to be very important to the structure and funktion of the protein. Together with the other non disease causing hot spots they form an densely packed hot spot area. For another residue phenylalanine (pdb position 37) the two databases HGMD and dbSNP partly agree with each other. While HGMD identifies this point mutation only as disease causing, dbSNP also documents a synonymous SNP here. Together with the disease causing mutation of methionine (pdb position 53), both residues stabilizes two beta sheets. The two residues are each located in one strand and ly very close in the secondary structure. This two beta sheets play a significant role, as a different arangement of them causes Gaucher disease.

<figtable id="hot2">

Position Residue Top Score Mutation
25 S 12.49 -
50 E 11.98 -
15 V 11.64 dc
14 V 11.47 -
4 C 11.45 -
37 F 11.38 dc/ndc
53 M 10.66 dc
26 F 10.62 -
16 C 10.55 dc
22 Y 10.33 -
Top scores of 10 Hot Spots calculated from the 497 (L) best scoring pairs. Information about mutation is taken from HGMD as well as dbSNP and differs between disease causing (dc) and non disease causing (ndc). The positions refer to the pdb positions. The position differs from the position of the uniprot sequence for 39 residues. One point mutation can be classified as synonymous and non-synonymous SNP.

</figtable>

Compared to the of EVcouplings prediction, both methods have only 14% overlaping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlaping hot spots of both programs.

Calculate Structural Models

We generated structural models of HRas (P01112) by using Evfold. We run Evfold with different scores (DI, PLM). Furthermore we considered various numbers of contacts. Dependent on the sequence length, we instructed the programm to concentrate on 20/40/60/80/100% of contacts. According to this the structures differ in their similarity to the reference structure (121p). In the gallery below the maps of the predicted contacts can be seen

To generate structural models, use EVfold from the evolutionary couplings server. For this task consider different numbers of contacts. Typically optimal models are generated with number of contacts approximately 60-70% of L (L = length of aligned protein sequence). Try varying numbers of contacts, including:

```   the optimal 60-70% of L
40% of L
100% of L
```

When using the EVfold server (EVcouplings and EVfold), i.e. to calculate correlated mutations and generate models, use in addition to the DI score also PLM for couplings scoring.

To check the quality of the predicted models, calculate the Ca-RMSD to the crystal structure, e.g. using Pymol, or analyze the corresponding output from the EV server.

Comparison of all predictions of Evfold. For each score and number of constraints, 5 models were grouped in a box of the plot. Models based on the DI score are colored orange. The remaining ones (blue) were calculated with the PLM score. For each score, the models with the same constraint number, wich had the best rmsds, were darkened.

<figtable id="rmsd">

Constraints
Score 20% 40% 60% 80% 100%
DI 8.43 6.84 4.36 5.10 7.06
PLM 9.92 4.23 3.24 3.03 3.42
Average RMSD (C alpha) of five models with the same score and number of constraints, generated by EVfold.

</figtable>