Task 6 (MSUD)

From Bioinformatikpedia
Revision as of 15:32, 18 June 2013 by Schillerl (talk | contribs)


Lab journal



Distribution of CN scores of contacts predicted by freecontact over H-Ras protein family.

Corrected norm contact (CN) scores are gathered from pairs of residues which have larger than 5 residues in between. Distribution of CN scores shows a normal-like distribution where most residue pairs have low contacts and few residues have very significant contacts between them. The CN scores from prediction over H-Ras protein range from -0.65 to 5.99.

After fitting the CN scores to a normal distributes (μ = 3.7×10-4, σ=0.494), we choose the 98-th percentile (≈1.0) as cutoff for significant contacts. With the R script by Laura, 63 high-scoring contacts were found. In comparison to the protein structure 121P 54 contact pairs are categorized as true positives. Following are the top-10 contact pairs with highest CN scores:

1. residue # 1. aa 2. residue # 2. aa MI score CN score true/false positive
11 A 92 D 0.32 3.40 TP
81 V 116 N 0.24 3.00 TP
87 T 129 Q 0.22 2.68 FP
19 L 81 V 0.15 2.55 TP
82 F 141 Y 0.25 2.53 TP
84 I 115 G 0.16 2.52 TP
82 F 115 G 0.14 2.42 TP
10 G 16 K 0.39 2.24 TP
130 A 141 Y 0.39 2.23 TP
123 R 143 E 0.27 2.20 TP

After computing and sorting high scoring residues, we have found 4 different hot spots, which seems to be responsible to the stability of structure of H-Ras. They are the Phe 82, Val 81, Ala 11 and Tyr 40. The residues contribute to the structural stability are marked as fallowing:


Evolutionary conserved residue pairs in HRas.

Top-10 residue pairs with high CN scores. In right figure, the most evolutionary conserved residue pairs are shown in red and orange color.

1. residue # 1. aa 2. residue # 2. aa MI score CN score
10 G 16 K 0.18957 0.218422
13 G 21 I 0.676722 0.0991369
11 A 92 D 0.427692 0.0868354
117 K 145 S 0.125656 0.0831162
116 N 146 A 0.117421 0.0787835
81 V 116 N 0.150175 0.0769825
82 F 141 Y 0.316719 0.0711629
35 T 60 G 0.082395 0.0629404
130 A 141 Y 0.343109 0.0615978
114 V 155 A 0.254632 0.0612646


Default parameters for Evfold yield models with 5 different number of contacts L/5 (20%), 2L/5 (40%), 3L/5 (60%), 4L/5 (80%) and L. For each class of constraints 5 PDB structures are generated. Following is the reliability of the models in comparison to X-ray crystallographic structure of HRas (121P):


With different number of constraints, EVfold tries to find out more hotspots of evolutionary conserved residue pairs. But as is compared to the contact map of HRas, it still predicts different contact hotspots. The most significant difference locates at the region 60-120 where HRas has dense contacts.



From the output of freecontact, only those residue pairs were extracted, that are separated by at least 5 residues. Pairs that are near in sequence are natually coupled, but this is not interesting because it does not help to predict the 3D structure, where residues intereact, which are far away from each other in the 1D sequence. The CN (corrected norm contact) score values of these extracted pairs range between -0.87 and 5.28. The following diagram shows the distribution of the scores:

MSUD BCKDHA freecontact CN score distribution.png

Most scores are between -1 and 1. So those pairs, which have a score greater than 1 are considered high scoring and used for further analyses. High scoring pairs were regarded as true positive (TP), if their distance (between any pair of atoms) in the reference structure is below 5 Å. There are 94 TPs among 194 high scoring pairs, and this has a correlation to CN score of 0.31 (for a pair with higher CN score it is more likely that it is TP). This table shows the ten highest scoring pairs:

1. residue # 1. aa 2. residue # 2. aa MI score CN score true/false positive
247 H 291 Y 0.69 5.28 TP
236 F 264 C 0.45 4.02 TP
266 N 327 E 0.31 3.67 TP
239 G 270 A 0.35 3.48 TP
296 I 312 E 0.44 3.37 FP
316 R 324 F 0.60 3.34 TP
144 S 235 Y 0.48 3.17 TP
300 G 330 T 0.27 3.13 TP
261 I 317 A 0.25 3.06 TP
301 N 333 I 0.87 2.77 FP

In a second step, the highest scoring 300 (= alignment length) couplings were taken, scores for each residue were summed and normalized by the average score of these couplings. The residues with the highest values (with a gap to the others) are: Thr 338, Phe 130 and Tyr 158. Interestingly, according to Uniprot, Tyr 158 is in the thiamine pyrophosphate binding region, and near position 338 there are some modified residues (phosphoserine).


The following table shows the ten residue pairs with highest DI (direct information) scores:

1. residue # 1. aa 2. residue # 2. aa MI score DI score
209 L 243 E 0.81 0.22
266 N 327 E 0.64 0.17
247 H 291 Y 0.42 0.13
192 P 200 R 0.61 0.11
296 I 312 E 0.22 0.10
242 S 291 Y 0.34 0.09
190 Q 196 G 0.72 0.09
236 F 264 C 0.39 0.09
250 F 288 G 0.34 0.08
210 A 218 G 0.11 0.08

The pairs that are also in top ten for freecontact are: 266/327, 247/291, 296/312, 236/264. The evolutionary contraint (EC) hotspots identified by EVcouplings, which have the highest EC strength (with a gap to the others) are residues Glu 243 and Leu 209. According to UniProt, near residue 209 there is a potassium binding region, and according to PBD, 243 is a thiamine pyrophosphate binding residue. It has to be noted, that the residues identified as important by freecontact are not included in the calculation of EVcouplings (probably they are not part of the alignment generated by it).


For every number of contraints, EVfold computes 5 different structure models. The RMSD values for these models compared to the reference structure 1U5B are summarized in the following boxplot (RMSD values taken from EVfold output):

MSUD BCKDHA boxplot evfold RMSD.png

For 195 contraints (corresponding to 65 % of alignment length) the models agree more with the known structure, than if only 120 (40 %) contraints are used. To use 300 (100 %) constraints does not help for structure prediction compared to 65 %, it seems to work even worse for structure prediction (although there is a large range of RMSDs between the different models calculated withh 300 contraints).

In general, the RMSD values are very high, indicating that the models are very different to the real structure.

The contact maps show the predicted contacts compared to the real contacts in 1U5B (the calculation of EVfold does not cover the whole sequence):

MSUD BCKDHA fold ContactMap 120.png MSUD BCKDHA fold ContactMap 195.png MSUD BCKDHA fold ContactMap 300.png

For the lowest number of constraints, most are true positives, but some regions with contacts in the known structure are only covered, if more constraints are used. But in the second and third plot, it is visible that when using more constraints, there are more false positives (predicted contacts where in the real structure there is no contact). So using 60-70 % contraints may be a good compromise between recall and precision.


  • CN scores are high for residues which are close in sequence, because they are consequently also close in the structure. But in the 3D structure, such residues can interact, that have a great distance in the sequence. So residue pairs which are seperated from each other in sequence but have a high CN score help for predicting the structure.
  • Residues that are coupled to many other residues with a high score are called evolutionary hotspots. They have an important role for the function or the stabilization of the structure.
  • For the prediction of structure models from evolutionary couplings, it is crucial to use an appropriate number of contacts. If not enough contacts are used, it is not sufficient to define the whole structure, but if too many are used, there will be false positives that negatively influence the quality of the model.
  • Structure models created only with alignments and the analysis of evolutionary couplings are rather approximate and do not reach the quality of homology modelling.