Task 6: Protein structure prediction from evolutionary sequence variation
Contents
Contact Prediction
Contact prediction tries to determine, from sequence alone, which residues are close in 3D space when the protein is folded. In this exercise, contacts were predicted for three domains using the freecontact tool (unpublished) with a statistical method developed by Marks et al. <ref name="EVfold_method" >Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref>. Two proteins were considered. Firstly, the human Ras protein, that has only one domain and secondly, the human HFE protein that has a MHC I domain and an Ig constant domain.
<css> table.colBasic2 { margin-left: auto; margin-right: auto; border: 1px solid black; border-collapse:collapse; width: 40%; }
.colBasic2 th,td { padding: 3px; border: 1px solid black; }
.colBasic2 td { text-align:left; }
/* for orange try #ff7f00 and #ffaa56 for blue try #005fbf and #aad4ff
maria's style blue: #adceff grey: #efefef
- /
.colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;}
table.basic2 { margin-left: auto; margin-right: auto; border: 0px solid black; border-collapse:collapse; width: 40%; }
.basic2 th,td { padding: 3px; border: 1px solid black; }
.basic2 th { border-bottom: 2px solid black; background-color: #fff; }
.basic2 td { text-align:left; }
.basic2 tr:first-child th {
border-top: 0;
} .basic2 tr:last-child td {
border-bottom: 0;
} .basic2 tr td:first-child, .basic2 tr th:first-child {
border-left: 0;
} .basic2 tr td:last-child, .basic2 tr th:last-child {
border-right: 0;
}
.ftable { border-collapse:collapse } .ftable td { border-style: none }
.ftable2 { border-collapse:collapse } .ftable2 td {border-collapse: collapse } </css>
<figure id="hfe_score_dist" >
</figure>
The residue pairs close in sequence rank among the highest scoring, because it lies in the nature of proteins, that residues that are close in sequence, are also close in structure. Consequently, they are evolutionary coupled and show covariation in the multiple sequence alignment. The distant pairs that are at least five residues apart in sequence, are more interesting for structure prediction, because they contain more information about the overall topology of the protein, i.e. they reduce the space of possible protein conformations more than pairs that are close in sequence.
The corrected norm (CN) score distributions for all three domains are shown in <xr id="hfe_score_dist" />. For the MHC I domain, the values range from -0.94 to 2.57 with the mean at -0.07. The distribution corresponds to a slightly right skewed normal distribution, where most values are in the range of -1 to 1. Only 0.5% of the scores have a value above 1. The distributions for the other two domains are very similar with the mean around 0 and most values ranging between -1 and 1. Thus, scores with a value greater than one can be considered as high scoring.
<figtable id="tab_domains">
domain | length | seq. in alignment | reference | HS pairs | TP | FP | TP -rate |
---|---|---|---|---|---|---|---|
Ras | 160 | 21151 | 5P21 | 65 | 53 | 12 | 0.82 |
MHC I | 174 | 25167 | 1A6Z_A | 69 | 29 | 40 | 0.42 |
Ig C1-set | 76 | 16509 | 1A6Z_A | 15 | 9 | 6 | 0.6 |
</figtable> <figure id="cm_hras">
</figure> <figure id="cm_1a6z">
</figure>
The TP-rate among the high scoring pairs can vary greatly, as shown in <xr id="tab_domains" />. For the three domains under consideration, the TP-rate ranges between 0.8 for Ras to 0.4 for the MHC I domain. The corresponding contact maps (<xr id="cm_hras" /> and <xr id="cm_1a6z" />) show that for the MHC I domain, there are more FP predictions involving residues that are very distant in sequence than in the Ras and Ig domain, which might cause problems for the structure prediction. For the high scoring pairs, the TP-rate can range from 0.8 for Ras to 0.4 for the MHC I domain. And the fact, that the number of sequences in the Pfam alignments of all three domains is above 15 000, indicates, that the TP-rate depends not only on the number of sequences in the alignment, but also on the actual sequence at hand. As discussed in <ref name="EVfold_method" />, possible confounding factors could be phylogenetic bias or functional constraints from interactions with other molecules.
<figure id="score_correlation">
</figure>
The correlation between CN score and TP or FP contacts is not very good as indicated by a Pearson correlation coefficient of 0.35. Although the scores with a value above 1.5, are exclusively true positives, there is a considerable overlap between the score distribution of TP and FP (<xr id="score_correlation" />).
Evolutionary Hotspots
<figtable id="ev_hotspot">
MHC | Ig C1-set | ||||||||
---|---|---|---|---|---|---|---|---|---|
AA | Pos. | Norm. Score | SNP | Conservation | AA | Pos | Norm. Score | SNP | Conservation |
V | 59 | 9.97 | Val->Met (DC) | 2 | C | 225 | 9.09 | - | 10 |
C | 124 | 8.20 | - | 2 | W | 239 | 6.53 | - | 10 |
L | 33 | 7.08 | - | 2 | C | 282 | 6.04 | Cys -> Tyr (DC) | 2 |
Y | 140 | 6.96 | - | 3 | H | 286 | 5.66 | - | 5 |
V | 120 | 5.96 | - | 0 | P | 232 | 5.16 | - | 10 |
S | 27 | 5.64 | - | 0 | I | 235 | 4.40 | - | 9 |
L | 91 | 5.54 | - | 0 | M | 237 | 3.94 | - | 9 |
A | 37 | 5.00 | Ala->Val (nDC) | 0 | I | 268 | 3.91 | - | 1 |
L | 44 | 4.99 | - | 0 | W | 267 | 3.51 | - | 0 |
G | 51 | 4.71 | - | 3 | D | 261 | 3.10 | - | 8 |
</figtable>
The ten highest scoring residues for the MHC I and the Ig domain are listed in <xr id="ev_hotspot" />. Although only two of the top ten residues of the MHC I domain are associated with a SNP, it is remarkable, that the top scoring residue is associated with a disease causing SNP. The conservation for this domain is also quite low.
For the Ig domain, it is also remarkable, that the position of the most common hemochromatosis causing SNP (C282Y) is in the top 3 sites. The cysteins at position 225 and 282 form a disulfide bond that is essential for the stability of the domain, which explains the high conservation at position 225. The low conservation at position 282 on the other hand might be caused by the high frequency of C282Y. Or maybe the binding partner of C225 is just at another position in the other proteins of the family. Of course, this method is not exhaustive, i.e. cannot identify all functionally essential residues. E.g. position 63, the site of the second most common disease causing mutation, is missing. But it can give a good hint about what the functionally important sites are.
Comparison of DI and CN scoring for contact prediction
When comparing the top 50 DI and CN scores for HRas, only 20 of the top 50 scoring pairs overlapped (40%). This shows that the two scoring methods yield very different results. Now either the better scoring method has to be determined, or a combination of the two methods could be used. For example, the top n/2 (where n is the desired number of contacts) results of either method can be used in order to get a sufficient amount of contacts for structure prediction.
EVcouplings and EVfold did unfortunately not work for the HFE protein [[1]].
Structural Models
The Ras structures were calculated with three different numbers of constraints relative to the sequence length: 76 (40%), 123 (65%) and 189 (100%). Additionally, two different methods were used to score the predicted contacts: Direct Information (DI) and Pseudo-likelihood maximization (PLM).
<figure id="rmsd_evfold">
</figure>
<figure id="ras_fold_pymol">
</figure>
The distribution of the Ca-RMSDs for the different combinations of restraints and scoring method (<xr id="rmsd_evfold"/>) shows two things. Firstly, PLM scoring produces better results than DI scoring and secondly, the quality of the models declines as the number of restraints increases. Also, the visual inspection of the two models with best RMSD (<xr id="ras_fold_pymol" />) shows, that some elements are placed closer to the crystal structure for PLM than for DI scoring.
<figtable id="ras_fold_cm">
76 EC constraints | 123 EC constraints | 189 EC constraints | |
DI scoring | |||
PLM scoring | |||
CN Scoring |
</figtable>
The reason for this observation can easily be found when looking at the corresponding contact maps (<xr id="ras_fold_cm"/>). The rank of the TP-rate for one combination of constraint number and scoring method almost always corresponds to the rank of the corresponding best-RMSD structure. The only exception is the 76 EC and DI (76/DI) scoring combination that yields a worse structure than the 189 PLM (189/PLM) scoring combination although it has a slightly higher TP-rate. This can be explained by the different location of the false positive contacts. For 189/PLM, the false contacts are mostly located at the beginning and the end of the sequence and there are also nearly no FP contacts that have a high distance in the sequence. For 76/DI on the other hand, the FP contacts are more evenly distributed over the sequence and also involve more distant residue pairs. The only drawback of PLM scoring is that it requires considerably more computation time than DI scoring. When comparing the CN to DI and PLM scoring, the TP-rates are up to 10% better than those for PLM scoring, which promises even better structures.
In conclusion, in order to obtain a high quality model, the TP-rate of the predicted contacts is crucial and the number of FP contacts that have a high sequence distance should be as low as possible. Also, if time is not an issue, PLM scoring should be preferred over DI scoring.
References
<references />