Gaucher Disease: Task 06 - Protein structure prediction from evolutionary sequence variation
Not all predicted contacts are needed to predict structure from sequence. Residues that ly close to each other in their primary structure, have automatically contact due to their direct neigbourhood in the sequence. Such a contact does not give any information, but rather leads to noise in the results. We are interessted in the contacts that apear because of the secondary and tertiary structure of the proteins. This information, we get from residues that have a greater distance in the sequence, but should be in contact in space acording to their distance.
HRas
The CN scores between all residues range between -0.65 and 6 (<xr id="rhas_dist"/>). By only looking at contacts between residues with a sequence distance of at least 5 residues, the upper range of the CN scores decreases to 3.4. The score distributions of the described sets:
<figtable id="rhas_dist">
residues | minimum | lower quartile | median | upper quartile | maximum |
---|---|---|---|---|---|
all | -0.65 | -0.23 | -0.11 | 0.04 | 6.00 |
filtered | -0.65 | -0.24 | -0.13 | 0.01 | 3.40 |
</figtable>
Residue pairs with a CN>1 are defined as high scoring pairs. These pairs are predicted to be in contact. Only nearly 5% of the residue pairs have a score high enough to be seen as contacts. In the filtered set this applies to less than 1%.
The high scoring pairs were compared to contacts of the HRas documented in pdb. The 65 predicted contacts have a TP-rate of 84.6% and could be classified into
- TP: 55
- FP: 10
Although more than half of FP predicted contacts have a lower score, there can be seen no correlation between FP/TP and the CN score. The Pearson correlation leads only to a non-significanz of 0.15.
On the contact map, a significant pattern for a domain identification could not be observed.
Hot Spots The 10 residues with the best top score were defined as hot spots. A mutation at this residues will have an great influence on the 3D structure of the protein. For some of these residues snps are known. <xr id="hot"/> shows which mutations appear and if they are disease causing.
The 50 residues with the best CN were compared to 50 hot spots calculated by EVcouplings. These hot spots have an overlap of only 44%. The overlaping hotspots were also ranked very differently of both programs.
<figtable id="hot">
Position | Residue | Top Score | Mutation |
---|---|---|---|
82 | F | 9.91 | |
81 | V | 7.361 | |
141 | Y | 6.67 | |
143 | E | 6.52 | |
115 | G | 6.51 | |
40 | Y | 5.50 | |
84 | I | 5.40 | |
145 | S | 5.25 | |
116 | N | 5.01 | |
144 | T | 4.90 |
</figtable>
Glucocerebrosidase
The range of all CN scores lies between -0.66 and 4.36. By excluding contacts between residues that have an sequence distance of less than 5 residues, the highest CN score drops to 4.0. The Histogram of the CN score frequencey distribution has a very high gradient due to the great number of contacts with an CN score of 0. <figtable id="gluco_dist">
residues | minimum | lower quartile | median | upper quartile | maximum |
---|---|---|---|---|---|
all | -0.66 | -0.15 | -0.04 | 0.09 | 4.36 |
filtered | -0.66 | -0.15 | -0.04 | 0.08 | 4.00 |
</figtable>
Af
pearson: 0.20 (dif 29)
247
- TP: 97
- FP: 150
overlap 14%
<figtable id="hot2">
Position | Residue | Top Score | Mutation |
---|---|---|---|
64 | S | 12.492447 | |
89 | E | 11.975236 | |
54 | V | 11.643162 | |
53 | V | 11.468116 | |
43 | C | 11.451568 | |
76 | F | 11.379880 | |
92 | M | 10.660677 | |
65 | F | 10.616400 | |
55 | C | 10.553043 | |
61 | Y | 10.328963 |
</figtable>