Protein structure prediction from evolutionary sequence variation (Phenylketonuria)

From Bioinformatikpedia
Revision as of 17:26, 15 June 2013 by Waldraffs (talk | contribs) (H-Ras)

Summary

...

Multiple Sequence Alignment

Lab journal
The multiple alignment of Ras (PF00071) was downloaded from Pfam.

For our protein two domains are included (ACT and Biopterin domain).

Calculate and analyze correlated mutations

Lab journal

Results

Freecontact is a program that takes a multiple alignment and can calculate the mutual information score and the corrected norm contact score.

  • MI:...
  • CN:...

H-Ras

P01112 <figtable id="cn-ras">

Ten best results with highest CN for all and for extracted pairs
All pairs
pos1 aa1 pos2 aa2 MI CN
6 L 7 V 0.50 6.01
162 E 163 I 0.47 5.87
83 A 85 N 0.35 5.56
15 G 16 K 0.44 4.86
159 L 160 V 0.47 4.82
9 V 10 G 0.46 4.80
161 R 162 E 0.46 4.63
124 T 125 V 0.34 4.58
10 G 11 A 0.46 4.52
16 K 17 S 0.45 4.44
Extracted pairs
pos1 aa1 pos2 aa2 MI CN
11 A 92 D 0.32 3.40
81 V 116 N 0.24 3.00
87 T 129 Q 0.22 2.69
82 F 141 Y 0.25 2.53
84 I 115 G 0.16 2.53
19 L 81 V 0.14 2.50
82 F 115 G 0.14 2.42
10 G 16 K 0.39 2.26
130 A 141 Y 0.39 2.25
123 R 143 E 0.27 2.21
EVcouplings-extracted
pos1 aa1 pos2 aa2 MI CN
10 G 16 K 0.19 0.22
13 G 21 I 0.68 0.10
11 A 92 D 0.43 0.09
117 K 145 S 0.13 0.08
116 N 146 A 0.12 0.08
81 V 116 N 0.15 0.08
82 F 141 Y 0.32 0.07
35 T 60 G 0.08 0.06
130 A 141 Y 0.34 0.06
114 V 155 A 0.25 0.06

The ten results of freecontact calculated for H-RAS with highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the ten best residue pairs with DI score calculated with EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN). </figtable>

<figure id="cn_distr_ras">

CN-score distribution of H-Ras calculated with freecontact for all (green) and extracted residue pairs with a distance of at least six amino acids (purple).

</figure>

Comparing the best ten CN-scores for all and for residue pairs with at least a distant of more than five amino acids it is remarkable that for all pairs the ten best all are directly neighboring amino acids and have almost twice of the CN-score (<xr id="cn-ras"/>). The range of the CN-score goes from -0.65 to 3.40 for the extracted residue pairs. The distribution can be viewed in <xr id="cn_distr_ras"/>. Additionally the distribution of all residue pairs is shown. They are very similar, nevertheless the range of all pairs goes up to 6.01. For the DI-score calculation using EVcouplings a range between 0.00 and 0.22 was achieved. Five of the ten best results are in common for extracted residue pairs found with freecontact and EVcouplings, which are 10(G)-16(K), 11(A)-92(D), 81(V)-116(N), 82(F)-141(Y) and 130(A)-141(Y).


PAH

<figtable id="cn-pah">

Ten best results with highest CN for all and for extracted pairs
All pairs
pos1 aa1 pos2 aa2 MI CN
217 C 218 G 0.77 3.43
402 F 403 A 0.75 3.25
428 Q 430 L 0.81 3.03
216 Y 217 C 1.12 2.92
174 I 175 P 0.73 2.91
120 W 121 F 0.69 2.91
403 A 404 A 1.02 2.83
447 A 448 L 0.64 2.82
428 Q 429 Q 0.92 2.82
429 Q 430 L 0.81 2.78
Extracted pairs
pos1 aa1 pos2 aa2 MI CN
342 A 354 L 0.70 2.46
122 P 128 L 0.69 2.40
122 P 129 D 0.69 2.29
257 G 264 H 1.17 2.27
131 F 137 S 1.09 2.25
282 D 290 H 0.18 2.19
151 D 157 R 0.69 2.18
192 K 221 E 0.94 2.17
264 H 277 Y 0.63 2.10
120 W 128 L 0.67 2.09
EVcouplings-extracted
pos1 aa1 pos2 aa2 MI CN
257 G 264 H 1.15 0.25
342 A 354 L 0.81 0.22
352 G 382 F 0.54 0.20
192 K 221 E 0.74 0.18
282 D 290 H 0.24 0.17
174 I 218 G 0.45 0.16
347 L 354 L 0.48 0.15
365 L 385 L 0.80 0.14
326 W 377 Y 0.36 0.13
235 Q 241 R 0.56 0.13

The ten results of freecontact calculated for the biopterin domain of PAH with highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the best ten DI-scores also for extracted pairs calculated by EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN). </figtable>

<figure id="cn_distr_pah">

CN-score distribution of PAH domain biopterin calculated with freecontact for all (green) and extracted residue pairs with a distance of at least six amino acids (purple).

</figure>


For the biopterin-domain of PAH the CN-scores of freecontact for all residue pairs are ranged between -1.04 and 3.44, for extracted pairs they are between -1.04 and 2.46 and for the DI scores calculated with EVcouplings can be found in a range between 0.00 and 0.25. In <xr id="cn_distr_pah"/> the distribution of the CN-scores for all and extracted residue pairs can be viewed. Again in <xr id="cn-pah"/> the table with all residue pairs only includes neighboring pairs. Additionally the DI-scores of EVcouplings are lower than for freecontact, but never are negative. At comparing the pairs, four results can be found which are included in the ten best results for the extracted residue pairs calculated with freecontact and with EVcouplings. Those residue pairs are 342(A)-354(L), 257(G)-264(H), 282(D)-290(H) and 192(K)-221(E).

Calculate structural model

Lab journal

H-Ras

<figure id="ras_evfold">

a) Number constraints: 64.
b) Number constraints: 104.
c) Number constraints: 160.
Contact maps created at structure predictions using different number of constraints for H-Ras.</figure>

In <xr id="ras_evfold"/> the contact maps of the structure predictions are shown. There you can see that for 64 constraints the prediction matches pretty well. Nevertheless many regions are not covered. In contrast for 160 there are more regions that are covered or at least stronger indicated. However, in this case there are much more false positives. Using about 65% of the protein length as number of constraints, for this protein 104, seems to be a good compromise between the other two chosen numbers. More regions as for the first are covered, but with less false positives than for the last one.

Looking at the RMSD values given in the EVfold output, high scores can be seen. Using 40% of the protein length as constraint number the RMSD values range between 10.16 and 15.18, for 65% they lie between 12.16 and 13.81 and for 100% between 11.72 and 13.63. This can also be seen in figure... (boxplot ras_rmsd.png)

PAH

<figure id="biopterin_evfold">

a) Number constraints: 138.
b) Number constraints: 225.
c) Number constraints: 346.
Contact maps created at structure predictions using different number of constraints for PAH-Biopterin.

</figure> Like for H-Ras structure prediction with 65% of the protein length (b) as constraint number seems to be a good tradeoff between to few (a) or to many (c) constraints to cover the true structure (<xr id="biopterin_evfold"/>. Nevertheless here there are more false positives already found for 138 constraints and so without the reference structure beneath it would be very hard to see the true pattern. However, for 346 constraints the false positives are so many they build some false hotspots. Altogether the structure prediction is not so good for this protein. Maybe this is caused by the smaller protein family and therefore a smaller multiple alignment. Additionally the RMSD values show the same result as the contact map with scores around 20-22 which is nearly twice the score than for the H-Ras structure predictions. At a constraint number of 40% of the protein length (138) the RMSD values range between 20.24 and 24.10, at a number of 65% of the protein length (225) they reach scores between 19.89 and 24.19 and for 100% they are between 18.94 and 24.26. This can also be seen in figure... (boxplot pah_rmsd.png)

References

<references/>