Task 6: Protein structure prediction from evolutionary sequence variation
Contact Prediction
<css> table.colBasic2 { margin-left: auto; margin-right: auto; border: 1px solid black; border-collapse:collapse; width: 40%; }
.colBasic2 th,td { padding: 3px; border: 1px solid black; }
.colBasic2 td { text-align:left; }
/* for orange try #ff7f00 and #ffaa56 for blue try #005fbf and #aad4ff
maria's style blue: #adceff grey: #efefef
- /
.colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;}
table.basic2 { margin-left: auto; margin-right: auto; border: 0px solid black; border-collapse:collapse; width: 40%; }
.basic2 th,td { padding: 3px; border: 1px solid black; }
.basic2 th { border-bottom: 2px solid black; background-color: #fff; }
.basic2 td { text-align:left; }
.basic2 tr:first-child th {
border-top: 0;
} .basic2 tr:last-child td {
border-bottom: 0;
} .basic2 tr td:first-child, .basic2 tr th:first-child {
border-left: 0;
} .basic2 tr td:last-child, .basic2 tr th:last-child {
border-right: 0;
} </css>
<figtable id="hfe_score_dist" >
</figtable>
domain | length | seq. in alignment | reference | HS pairs | TP | FP | TP -rate |
---|---|---|---|---|---|---|---|
Ras | 160 | 21151 | 5P21 | 65 | 53 | 12 | 0.82 |
MHC I | 174 | 25167 | 1A6Z_A | 69 | 29 | 40 | 0.42 |
Ig C1-set | 76 | 16509 | 1A6Z_A | 15 | 9 | 6 | 0.6 |
<figure id="cm_1a6z">
</figure> <figure id="cm_hras">
</figure>
- Why are the scores of residues close in sequence amongst the highest? Why are the pairs distant in sequence (n>5) more interesting for structure prediction?
It lies in the nature of proteins, that residues that are close in sequence, are also close in structure. Consequently, they are evolutionary coupled and show covariation in the multiple sequence alignment. The pairs that are at least five residues apart in sequence, are more interesting for structure prediction, because they contain more information about the overall topology of the protein, i.e. they reduce the space of possible protein conformations more than pairs that are close in sequence.
- Look at the values, range and distribution of scores.
For the MHC I domain of HFE_HUMAN, the score distribution is shown in table <xr id="hfe_score_dist" />. The values range from -0.94 to 2.57 with the mean at -0.07. The score distribution corresponds to a slightly right skewed normal distribution, where most values are in the range of -1 to 1. Only 0.5% of scores have a value above 1. Thus, scores with a value greater than one can be considered as high scoring.
- How many of the high-scoring pairs are true or false positives? Does this correlate with the value of the score? Visualize the predicted contacts together with the crystal structure contacts in a contact map plot.
Table <xr id="hs_table"> shows, that TP-rate can range from 0.8 for Ras to 0.4 for the MHC I domain. Since the number of sequences in the multiple alignment is above 15 000 for all three domains, the TP-rate among the high scoring pairs depends not only on the number of sequences in the alignment, but also on the actual sequence at hand. As discussed in <ref name="EVfold_method"> Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref> , possible confounding factors could be pyhlogenetic bias or functional constraints from interactions with other molecules.
<figure id="score_correlation">
</figure>
The correlation between CN score and TP/FP contact is not very good as indicated by a Pearson correlation coefficient of 0.354 and the overlap between the boxplots in figure <xr id="score_correlation">.
- Can you determine evolutionary hot spots, i.e. functionally important residues? Compare to conserved sites in the MSA. Compare with your results from task 7 (when you are working on task 7, i.e. this is a task for the future).
MHC | Ig C1-set | ||||||||
---|---|---|---|---|---|---|---|---|---|
AA | Pos. | Norm. Score | SNP | Conservation | AA | Pos | Norm. Score | SNP | Conservation |
V | 59 | 9.97 | Val->Met (DC) | 2 | C | 225 | 9.09 | - | 10 |
C | 124 | 8.20 | - | 2 | W | 239 | 6.53 | - | 10 |
L | 33 | 7.08 | - | 2 | C | 282 | 6.04 | Cys -> Tyr (DC) | 2 |
Y | 140 | 6.96 | - | 3 | H | 286 | 5.66 | - | 5 |
V | 120 | 5.96 | - | 0 | P | 232 | 5.16 | - | 10 |
S | 27 | 5.64 | - | 0 | I | 235 | 4.40 | - | 9 |
L | 91 | 5.54 | - | 0 | M | 237 | 3.94 | - | 9 |
A | 37 | 5.00 | Ala->Val (nDC) | 0 | I | 268 | 3.91 | - | 1 |
L | 44 | 4.99 | - | 0 | W | 267 | 3.51 | - | 0 |
G | 51 | 4.71 | - | 3 | D | 261 | 3.10 | - | 8 |
- Here, the DI score is given. Compare the top 50 DI and CN (from freecontact) scores. How large is the overlap (>80%)?
For RAS_HUMAN, only 20(40%) of the top 50 scores overlaped.
imm EVFOLD
Structural Models
The structural models for Ras were calculated with three different numbers of evolutionary constraints: 76(40%), 123(65%) and 189 (100%).
<figtable id="ras_fold_cm">
76 EC constraints | 123 EC constraints | 189 EC constraints | |
DI scoring | |||
PLM scoring |
</figtable>
<figure id="rmsd_evfold">
</figure>
<figtable id="ras_fold_pymol">
</figtable>
- quality of structure is highly dependent on number of false positive contacts -> more FP contacts means worse structure - PLM scoring yields less FP than DI scoring but takes considerably more computation time
Discussion
<references />