Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation
Not all predicted contacts are needed to predict structure from sequence. Residues that ly close to each other in their primary structure, have automatically contact due to their direct neigbourhood in the sequence. Such a contact does not give any information, but rather leads to noise in the results. We are interessted in the contacts that apear because of the secondary and tertiary structure of the proteins. This information, we get from residues that have a greater distance in the sequence, but should be in contact in space acording to their distance.
HRas
The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:
<figtable id="rhas_dist">
residues | minimum | lower quartile | median | upper quartile | maximum |
---|---|---|---|---|---|
all | -0.65 | -0.23 | -0.11 | 0.04 | 6.00 |
filtered | -0.65 | -0.24 | -0.13 | 0.01 | 3.40 |
</figtable>
Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.
The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts have a TP-rate of 84.6% and could be classified into
- TP: 55
- FP: 10
Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significanz of 0.15.
On the contact map, a significant pattern for a domain identification could not be observed.
Hot Spots The 10 residues with the best top score were defined as hot spots. A mutation at this residues will have an great influence on the 3D structure of the protein. For some of these residues snps are known. <xr id="hot"/> shows which mutations appear and if they are disease causing.
The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlaping hotspots were also ranked very differently of both programs.
<figtable id="hot">
Position | Residue | Top Score |
---|---|---|
82 | F | 9.91 |
81 | V | 7.361 |
141 | Y | 6.67 |
143 | E | 6.52 |
115 | G | 6.51 |
40 | Y | 5.50 |
84 | I | 5.40 |
145 | S | 5.25 |
116 | N | 5.01 |
144 | T | 4.90 |
</figtable>
Glucocerebrosidase
The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The Histogram of the CN score frequencey distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">
residues | minimum | lower quartile | median | upper quartile | maximum |
---|---|---|---|---|---|
all | -0.66 | -0.15 | -0.04 | 0.09 | 4.36 |
filtered | -0.66 | -0.15 | -0.04 | 0.08 | 4.00 |
</figtable>
After filtering out the neighbour contacts, only 0.2% predicted contacts have an CN>1 and could be seen as high scoring pairs. These 247 pairs show a TP rate of 39% :
- TP: 97
- FP: 150
The Pearson correlation (0.2)cannot distinguish an correlation between the prediction state (TP/FP) and the CN score.
Hot Spots The 10 best hot spots of Glucocrebrosidase predicted by free contact are listed in <xr id="hot2"/>. All of these hot spots are located in a high conserved area (8 residues) or in striking distance (2 residues) of the T-Coffe alignment 1 (alignment position 128-177). According to HGMD, three of these hot spots are disease causing. Especially the third hot spot which has position 15 in 1OGS.pdb (or 54 at the uniprot sequence), is delicate for a mutation. At this position, the mutation of valine to methionine or leucine causes in both cases the Gaucher's disease. This valine seems to be very important to the structure and funktion of the protein.
<figtable id="hot2">
Position | Residue | Top Score | Mutation |
---|---|---|---|
25 | S | 12.49 | - |
50 | E | 11.98 | - |
15 | V | 11.64 | dc |
14 | V | 11.47 | - |
4 | C | 11.45 | - |
37 | F | 11.38 | dc |
53 | M | 10.66 | - |
26 | F | 10.62 | - |
16 | C | 10.55 | dc |
22 | Y | 10.33 | - |
</figtable>
Compared to the of EVcouplings prediction, both methods have only 14% overlaping residues in their best 50 hot spots. The main difference between these 50 predicted hot spots of EV coupling and freecontact are the position range of the hot spots. While EVcouplings finds a lot of hot spots in the higher positions, freecontact has most of its hot spots located within the residues on positions <100. Only 7 hot spots of freecontact have positions >430. These few residues correspond exactly to the overlaping hot spots of both programs.