Difference between revisions of "Task 6: Protein structure prediction from evolutionary sequence variation"

From Bioinformatikpedia
Line 99: Line 99:
 
</figure>
 
</figure>
   
*Why are the scores of residues close in sequence amongst the highest? Why are the pairs distant in sequence (n>5) more interesting for structure prediction?
+
The residue pairs close in sequence rank among the highest scoring, because it lies in the nature of proteins, that residues that are close in sequence, are also close in structure. Consequently, they are evolutionary coupled and show covariation in the multiple sequence alignment.
  +
The distant pairs that are at least five residues apart in sequence, are more interesting for structure prediction, because they contain more information about the overall topology of the protein, i.e. they reduce the space of possible protein conformations more than pairs that are close in sequence.
It lies in the nature of proteins, that residues that are close in sequence, are also close in structure. Consequently, they are evolutionary coupled and show covariation in the multiple sequence alignment.
 
  +
The pairs that are at least five residues apart in sequence, are more interesting for structure prediction, because they contain more information about the overall topology of the protein, i.e. they reduce the space of possible protein conformations more than pairs that are close in sequence.
 
   
*Look at the values, range and distribution of scores.
 
 
For the MHC I domain of HFE_HUMAN, the score distribution is shown in table <xr id="hfe_score_dist" />. The values range from -0.94 to 2.57 with the mean at -0.07.
 
For the MHC I domain of HFE_HUMAN, the score distribution is shown in table <xr id="hfe_score_dist" />. The values range from -0.94 to 2.57 with the mean at -0.07.
 
The score distribution corresponds to a slightly right skewed normal distribution, where most values are in the range of -1 to 1. Only 0.5% of scores have a value above 1.
 
The score distribution corresponds to a slightly right skewed normal distribution, where most values are in the range of -1 to 1. Only 0.5% of scores have a value above 1.
Line 109: Line 108:
   
   
  +
*How many of the high-scoring pairs are true or false positives? Does this correlate with the value of the score? Visualize the predicted contacts together with the crystal structure contacts in a contact map plot.
 
Table <xr id="hs_table"> shows, that TP-rate can range from 0.8 for Ras to 0.4 for the MHC I domain. Since the number of sequences in the multiple alignment is above 15 000 for all three domains, the TP-rate among the high scoring pairs depends not only on the number of sequences in the alignment, but also on the actual sequence at hand.
+
For the high scoring pairs, the TP-rate can range from 0.8 for Ras to 0.4 for the MHC I domain. Since the number of sequences in the multiple alignment is above 15 000 for all three domains, the TP-rate depends not only on the number of sequences in the alignment, but also on the actual sequence at hand.
 
As discussed in <ref name="EVfold_method"> Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref>
 
As discussed in <ref name="EVfold_method"> Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref>
 
, possible confounding factors could be pyhlogenetic bias or functional constraints from interactions with other molecules.
 
, possible confounding factors could be pyhlogenetic bias or functional constraints from interactions with other molecules.
Line 118: Line 117:
 
</figure>
 
</figure>
   
The correlation between CN score and TP/FP contact is not very good as indicated by a Pearson correlation coefficient of 0.354 and the overlap between the boxplots in figure <xr id="score_correlation">.
+
The correlation between CN score and TP or FP contacts is not very good as indicated by a Pearson correlation coefficient of 0.35. Although the scores with a value above 1.5, are exclusively true positives, there is a considerable overlap between the score distribution of TP and FP (Figure 5).
 
   
  +
=== Evolutionarly Hotspots ===
   
 
*Can you determine evolutionary hot spots, i.e. functionally important residues? Compare to conserved sites in the MSA. Compare with your results from task 7 (when you are working on task 7, i.e. this is a task for the future).
 
*Can you determine evolutionary hot spots, i.e. functionally important residues? Compare to conserved sites in the MSA. Compare with your results from task 7 (when you are working on task 7, i.e. this is a task for the future).
Line 151: Line 150:
 
|}
 
|}
   
  +
The ten highest scoring residues for the MHC I and the Ig domain are listed in Table 2.
  +
Although only two of the top ten residues of the MHC I domain are associated with a SNP, it is remarkable, that the top scoring residue is associated with a disease causing SNP. The conservation for this domain is also quite low.
   
  +
For the Ig domain, it is also remarkable, that the position of the most common hemochromatosis causing SNP (C282Y) is in the top 3 sites. The cysteins at position 225 and 282 form a disulfide bond that is essential for the stability of the domain, which explains the high conservation at position 225. The low conservation at position 282 on the other hand might be caused by the high frequency of C282Y. Or maybe the binding partner of C225 is just at another position in the other proteins of the family.
  +
Of course, this method is not exhaustive, i.e. cannot identify all functionally essential residues. E.g. position 63, the site of the second most common disease causing mutation, is missing. But it can give a good hint about what the functionally important sites are.
   
*Here, the DI score is given. Compare the top 50 DI and CN (from freecontact) scores. How large is the overlap (>80%)?
 
   
For RAS_HUMAN, only 20(40%) of the top 50 scores overlaped.
 
   
   
  +
When comparing the top 50 DI and CN scores for HRas, only 20(40%) of the top 50 scores overlapped. This just shows that the two scoring methods are quite different and it might be useful to use both methods in order to identify different functionally important sites.
  +
  +
The scores for the HFE protein could unfortunately not be computed [[https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Talk:Task_6_-_Protein_structure_prediction_from_evolutionary_sequence_variation]].
   
   
imm EVFOLD
 
   
 
== Structural Models ==
 
== Structural Models ==
Line 173: Line 176:
 
| || 76 EC constraints || 123 EC constraints || 189 EC constraints
 
| || 76 EC constraints || 123 EC constraints || 189 EC constraints
 
|-
 
|-
| DI scoring || [[File:Ras_di_76.png|center|thumb|300px| 46/76 = 0.61 TP-rate]] || [[File:Ras_di_123.png|center|thumb|300px| 61/123 = 0.5 TP-rate ]] || [[File:Ras_di_189.png|center|thumb|300px| 80/189 = 0.42 TP-rate]]
+
| DI scoring || [[File:Ras_di_76.png|center|thumb|300px| 0.61 TP-rate]] || [[File:Ras_di_123.png|center|thumb|300px| 0.5 TP-rate ]] || [[File:Ras_di_189.png|center|thumb|300px| 0.42 TP-rate]]
 
|-
 
|-
| PLM scoring || [[File:Ras_plm_76.png|center|thumb|300px| 61/76 = 0.80 TP-rate]] || [[File:Ras_plm_123.png|center|thumb|300px| 71/123 = 0.71 TP-rate]] || [[File:Ras_plm_189.png|center|thumb|300px| 111/189 = 0.59 TP-rate]]
+
| PLM scoring || [[File:Ras_plm_76.png|center|thumb|300px| 0.80 TP-rate]] || [[File:Ras_plm_123.png|center|thumb|300px| 0.71 TP-rate]] || [[File:Ras_plm_189.png|center|thumb|300px| 0.59 TP-rate]]
 
|-
 
|-
  +
| CN Scoring || [[File:Ras_cn_76.png|center|thumb|300px| 61/76 = 0.91 TP-rate]] || [[File:Ras_cn_123.png|center|thumb|300px| 71/123 = 0.77 TP-rate]] || [[File:Ras_cn_189.png|center|thumb|300px| 111/189 = 0.63 TP-rate]]
|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Figure 7:''' Contact maps from the EVfold predictions for Ras are shown. For each map, the rate of true positive contact predictions is shown.
 
  +
|-
  +
|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Figure 7:''' Contact maps from the EVfold predictions for Ras are shown. For each map, the rate of true positive contact predictions with respect to the number of constraints used is shown.
 
|}
 
|}
 
</figtable>
 
</figtable>
Line 195: Line 200:
 
</figtable>
 
</figtable>
   
The Ras structures were calculated with three different numbers of constraints relative to the sequence length: 76(40%), 123(65%), 189(100%). Additionally, two different methods were used to score the predicted contacts: Direct Information(DI) and Pseudo-likelihood maximization(PLM). The distribution of the Ca-RMSDs for the different combinations of restraints and scoring method (Figure 6) shows two things. Firstly, PLM scoring produces better results than DI scoring and secondly, the quality of the models declines as the number of restraints increases. The reason for this observation can easily be found when looking at the corresponding contact maps (Figure 7). The rank of the TP-rate for one combination of constraint number and scoring method almost always corresponds to the rank of the corresponding best-RMSD structure. The only exception is the 76 EC and DI (76/DI) scoring combination that yields a worse structure than the 189 PLM (189/PLM) scoring combination although it has a slightly higher TP-rate.
+
The Ras structures were calculated with three different numbers of constraints relative to the sequence length: 76(40%), 123(65%), 189(100%). Additionally, two different methods were used to score the predicted contacts: Direct Information(DI) and Pseudo-likelihood maximization(PLM). The distribution of the Ca-RMSDs for the different combinations of restraints and scoring method (Figure 6) shows two things. Firstly, PLM scoring produces better results than DI scoring and secondly, the quality of the models declines as the number of restraints increases. Also, the visual inspection of the two models with best RMSD (Figure 8) shows, that some elements are placed closer to the crystal structure for PLM than for DI scoring.
  +
The reason for this observation can easily be found when looking at the corresponding contact maps (Figure 7). The rank of the TP-rate for one combination of constraint number and scoring method almost always corresponds to the rank of the corresponding best-RMSD structure. The only exception is the 76 EC and DI (76/DI) scoring combination that yields a worse structure than the 189 PLM (189/PLM) scoring combination although it has a slightly higher TP-rate.
 
This can be explained by the different location of the false positive contacts. For 189/PLM, the false contacts are mostly located at the beginning and the end of the sequence and there are also nearly no FP contacts that have a high distance in the sequence. For 76/DI on the other hand, the FP contacts are more evenly distributed over the sequence and also involve more distant residue pairs.
 
This can be explained by the different location of the false positive contacts. For 189/PLM, the false contacts are mostly located at the beginning and the end of the sequence and there are also nearly no FP contacts that have a high distance in the sequence. For 76/DI on the other hand, the FP contacts are more evenly distributed over the sequence and also involve more distant residue pairs.
The only drawback of PLM scoring is that it requires considerably more computation time than DI scoring.
+
The only drawback of PLM scoring is that it requires considerably more computation time than DI scoring.
  +
When comparing the CN to DI and PLM scoring, the TP-rates are up to 10% better than those for PLM scoring, which promises even better structures.
  +
   
 
In conclusion, in order to obtain a high quality model, the TP-rate of the predicted contacts is crucial and the number of FP contacts that have a high sequence distance should be as low as possible. Also, if time is not an issue, PLM scoring should be preferred over DI scoring.
 
In conclusion, in order to obtain a high quality model, the TP-rate of the predicted contacts is crucial and the number of FP contacts that have a high sequence distance should be as low as possible. Also, if time is not an issue, PLM scoring should be preferred over DI scoring.
TODO:
 
-compare to contact maps with CN scoring
 
76, 123, 189
 
0.9216867 0.8132530 0.6325301
 
   
   
  +
DONE:
 
-quality of model depends on TP-rate of contacts and if FP contacts are distant in sequence and evenly distributed.
 
- quality of structure is highly dependent on number of false positive contacts -> more FP contacts means worse structure
 
- PLM scoring yields less FP than DI scoring but takes considerably more computation time
 
== Discussion ==
 
   
 
<references />
 
<references />

Revision as of 11:43, 29 August 2013

lab journal


Contact Prediction

<css> table.colBasic2 { margin-left: auto; margin-right: auto; border: 1px solid black; border-collapse:collapse; width: 40%; }

.colBasic2 th,td { padding: 3px; border: 1px solid black; }

.colBasic2 td { text-align:left; }

/* for orange try #ff7f00 and #ffaa56 for blue try #005fbf and #aad4ff

maria's style blue: #adceff grey: #efefef

  • /

.colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;}

table.basic2 { margin-left: auto; margin-right: auto; border: 0px solid black; border-collapse:collapse; width: 40%; }

.basic2 th,td { padding: 3px; border: 1px solid black; }

.basic2 th { border-bottom: 2px solid black; background-color: #fff; }

.basic2 td { text-align:left; }

.basic2 tr:first-child th {

 border-top: 0;

} .basic2 tr:last-child td {

 border-bottom: 0;

} .basic2 tr td:first-child, .basic2 tr th:first-child {

 border-left: 0;

} .basic2 tr td:last-child, .basic2 tr th:last-child {

 border-right: 0;

} </css>


<figtable id="hfe_score_dist" >

Ras
MHC I domain
Ig C1-set domain
Figure 1: CN-score distributions for all three domains.

</figtable>

domain length seq. in alignment reference HS pairs TP FP TP -rate
Ras 160 21151 5P21 65 53 12 0.82
MHC I 174 25167 1A6Z_A 69 29 40 0.42
Ig C1-set 76 16509 1A6Z_A 15 9 6 0.6
Table 1: Overview of the domains of interest.


<figure id="cm_1a6z">

Figure 4: Predicted (red) and reference(grey) contacts of the MHC I domain (left box) and the Ig domain (right box) of the hemochromatosis protein.

</figure> <figure id="cm_hras">

Figure 3: Predicted (red) and reference(grey) contacts of the ras protein.

</figure>

The residue pairs close in sequence rank among the highest scoring, because it lies in the nature of proteins, that residues that are close in sequence, are also close in structure. Consequently, they are evolutionary coupled and show covariation in the multiple sequence alignment. The distant pairs that are at least five residues apart in sequence, are more interesting for structure prediction, because they contain more information about the overall topology of the protein, i.e. they reduce the space of possible protein conformations more than pairs that are close in sequence.


For the MHC I domain of HFE_HUMAN, the score distribution is shown in table <xr id="hfe_score_dist" />. The values range from -0.94 to 2.57 with the mean at -0.07. The score distribution corresponds to a slightly right skewed normal distribution, where most values are in the range of -1 to 1. Only 0.5% of scores have a value above 1. Thus, scores with a value greater than one can be considered as high scoring.


For the high scoring pairs, the TP-rate can range from 0.8 for Ras to 0.4 for the MHC I domain. Since the number of sequences in the multiple alignment is above 15 000 for all three domains, the TP-rate depends not only on the number of sequences in the alignment, but also on the actual sequence at hand. As discussed in <ref name="EVfold_method"> Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref> , possible confounding factors could be pyhlogenetic bias or functional constraints from interactions with other molecules.

<figure id="score_correlation">

Figure 5:' The distribution of scores for TP and FP predicted contact pairs of MHC I is shown. Note that only high scoring pairs are shown in this plot.

</figure>

The correlation between CN score and TP or FP contacts is not very good as indicated by a Pearson correlation coefficient of 0.35. Although the scores with a value above 1.5, are exclusively true positives, there is a considerable overlap between the score distribution of TP and FP (Figure 5).

Evolutionarly Hotspots

  • Can you determine evolutionary hot spots, i.e. functionally important residues? Compare to conserved sites in the MSA. Compare with your results from task 7 (when you are working on task 7, i.e. this is a task for the future).
MHC Ig C1-set
AA Pos. Norm. Score SNP Conservation AA Pos Norm. Score SNP Conservation
V 59 9.97 Val->Met (DC) 2 C 225 9.09 - 10
C 124 8.20 - 2 W 239 6.53 - 10
L 33 7.08 - 2 C 282 6.04 Cys -> Tyr (DC) 2
Y 140 6.96 - 3 H 286 5.66 - 5
V 120 5.96 - 0 P 232 5.16 - 10
S 27 5.64 - 0 I 235 4.40 - 9
L 91 5.54 - 0 M 237 3.94 - 9
A 37 5.00 Ala->Val (nDC) 0 I 268 3.91 - 1
L 44 4.99 - 0 W 267 3.51 - 0
G 51 4.71 - 3 D 261 3.10 - 8
Table 2: Residues with the top ten normalised scores for the MHC_I and Ig C1-set domains. For each position, it is shown whether there is a known SNP and if the SNP is known to be disease causing(DC) or not (nDC). The conservation in the Pfam alignment is indicated by a value from 0(not conserved) to 10(highly conserved).

The ten highest scoring residues for the MHC I and the Ig domain are listed in Table 2. Although only two of the top ten residues of the MHC I domain are associated with a SNP, it is remarkable, that the top scoring residue is associated with a disease causing SNP. The conservation for this domain is also quite low.

For the Ig domain, it is also remarkable, that the position of the most common hemochromatosis causing SNP (C282Y) is in the top 3 sites. The cysteins at position 225 and 282 form a disulfide bond that is essential for the stability of the domain, which explains the high conservation at position 225. The low conservation at position 282 on the other hand might be caused by the high frequency of C282Y. Or maybe the binding partner of C225 is just at another position in the other proteins of the family. Of course, this method is not exhaustive, i.e. cannot identify all functionally essential residues. E.g. position 63, the site of the second most common disease causing mutation, is missing. But it can give a good hint about what the functionally important sites are.



When comparing the top 50 DI and CN scores for HRas, only 20(40%) of the top 50 scores overlapped. This just shows that the two scoring methods are quite different and it might be useful to use both methods in order to identify different functionally important sites.

The scores for the HFE protein could unfortunately not be computed [[1]].


Structural Models

<figure id="rmsd_evfold">

Figure 6: Comparison of Ca-RMSD values distributions between the different numbers of ECs used. The DI scoring results are shown in green and the PLM scoring results in blue.

</figure>

<figtable id="ras_fold_cm">

76 EC constraints 123 EC constraints 189 EC constraints
DI scoring
0.61 TP-rate
0.5 TP-rate
0.42 TP-rate
PLM scoring
0.80 TP-rate
0.71 TP-rate
0.59 TP-rate
CN Scoring
61/76 = 0.91 TP-rate
71/123 = 0.77 TP-rate
111/189 = 0.63 TP-rate
Figure 7: Contact maps from the EVfold predictions for Ras are shown. For each map, the rate of true positive contact predictions with respect to the number of constraints used is shown.

</figtable>





<figtable id="ras_fold_pymol">

DI scoring, Ca-RMSD=3.75 A
PLM scoring, Ca-RMSD=2.54 A
Figure 8: The ras crystal structure(green,5p21) is shown in comparison to the EVfold models with best RMSD from DI scoring(cyan) and PLM scoring(purple).

</figtable>

The Ras structures were calculated with three different numbers of constraints relative to the sequence length: 76(40%), 123(65%), 189(100%). Additionally, two different methods were used to score the predicted contacts: Direct Information(DI) and Pseudo-likelihood maximization(PLM). The distribution of the Ca-RMSDs for the different combinations of restraints and scoring method (Figure 6) shows two things. Firstly, PLM scoring produces better results than DI scoring and secondly, the quality of the models declines as the number of restraints increases. Also, the visual inspection of the two models with best RMSD (Figure 8) shows, that some elements are placed closer to the crystal structure for PLM than for DI scoring. The reason for this observation can easily be found when looking at the corresponding contact maps (Figure 7). The rank of the TP-rate for one combination of constraint number and scoring method almost always corresponds to the rank of the corresponding best-RMSD structure. The only exception is the 76 EC and DI (76/DI) scoring combination that yields a worse structure than the 189 PLM (189/PLM) scoring combination although it has a slightly higher TP-rate. This can be explained by the different location of the false positive contacts. For 189/PLM, the false contacts are mostly located at the beginning and the end of the sequence and there are also nearly no FP contacts that have a high distance in the sequence. For 76/DI on the other hand, the FP contacts are more evenly distributed over the sequence and also involve more distant residue pairs. The only drawback of PLM scoring is that it requires considerably more computation time than DI scoring. When comparing the CN to DI and PLM scoring, the TP-rates are up to 10% better than those for PLM scoring, which promises even better structures.


In conclusion, in order to obtain a high quality model, the TP-rate of the predicted contacts is crucial and the number of FP contacts that have a high sequence distance should be as low as possible. Also, if time is not an issue, PLM scoring should be preferred over DI scoring.



<references />