Protein structure prediction from evolutionary sequence variation (Phenylketonuria)

From Bioinformatikpedia
Revision as of 21:52, 5 September 2013 by Worfk (talk | contribs) (Calculate and analyze correlated mutations)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Summary

According to the huge number of known sequences and the low quantity of known structures it is an important challenge to predict the structure of proteins from sequence information alone. Mostly the sequence of the interested protein is part of an evolutionarily related family of sequences, which are probable to have the same fold fundamentally. The evolutionary variation in sequences is limited by a number of necessities, like e.g. the maintenance of favourable interactions in direct residue-residue associations. To find out which residue pairs are lying in the 3D structure close to each other, one uses pair correlations in the multiple alignments. Then a subset of the predicted residue contact pairs is used to fold up any protein of the family into an approximated 3D structure. This is subsequently refined with different techniques to generate a finished structure for the protein of interest. <ref name="evol"> Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina and Chris Sander (2011): "Protein 3D Structure Computed from Evolutionary Sequence Variation". PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref> So for the structure prediction based on the sequence, we first downloaded a multiple sequence alignment of Ras (PF00071) from the Pfam database and generated one for our protein PAH with PSI-BLAST and ClustalW. Then, we used freecontact and EVfold for the structure prediction. Subsequentely, we calculated the structural models for both proteins. Our results and further discussions are following in the sections below.

Multiple Sequence Alignment

Lab journal
The multiple alignment of Ras (PF00071) was downloaded from Pfam.

For our protein two domains are included (ACT and Biopterin domain).

Calculate and analyze correlated mutations

Lab journal

Results

Freecontact is a program which takes a multiple alignment to calculate the mutual information score and the corrected norm contact score. In one step, residue pairs that are separated by more than five residues are extracted. This is necessary because if they are to close they mostly get high CN-scores as they have local connections. However, for structure prediction major distant residues are more interesting as connections between distant amino acids since this might lead to the compact structure.

H-Ras

<figure id="cn_distr_ras">

CN-score distribution of H-Ras calculated with freecontact for all (green) and extracted residue pairs with a distance of at least six amino acids (purple).

</figure>

<figure id="tp_ras">

Rate of the true positives for the sorted and extracted residue pairs. Always 50 pairs are taken together. The y-axis represents, how many of the 50 pairs are true positives or false positives, where 0 means that all 50 pairs are tp and 1 that all 50 pairs are fp.

</figure> <figure id="cm_ras">

Contact map of the best 104 residue pairs of the freecontact prediction (red). In grey the contact map of the pdb structure 121p is shown.

</figure>

<figtable id="cn-ras">

Ten best results with highest CN for all and for extracted pairs
All pairs
pos1 aa1 pos2 aa2 MI CN
6 L 7 V 0.50 6.01
162 E 163 I 0.47 5.87
83 A 85 N 0.35 5.56
15 G 16 K 0.44 4.86
159 L 160 V 0.47 4.82
9 V 10 G 0.46 4.80
161 R 162 E 0.46 4.63
124 T 125 V 0.34 4.58
10 G 11 A 0.46 4.52
16 K 17 S 0.45 4.44
Extracted pairs
pos1 aa1 pos2 aa2 MI CN
11 A 92 D 0.32 3.40
81 V 116 N 0.24 3.00
87 T 129 Q 0.22 2.69
82 F 141 Y 0.25 2.53
84 I 115 G 0.16 2.53
19 L 81 V 0.14 2.50
82 F 115 G 0.14 2.42
10 G 16 K 0.39 2.26
130 A 141 Y 0.39 2.25
123 R 143 E 0.27 2.21
EVcouplings-extracted
pos1 aa1 pos2 aa2 MI DI
10 G 16 K 0.19 0.22
13 G 21 I 0.68 0.10
11 A 92 D 0.43 0.09
117 K 145 S 0.13 0.08
116 N 146 A 0.12 0.08
81 V 116 N 0.15 0.08
82 F 141 Y 0.32 0.07
35 T 60 G 0.08 0.06
130 A 141 Y 0.34 0.06
114 V 155 A 0.25 0.06

The ten best results of freecontact calculated for H-RAS with the highest CN-value first calculated for all pairs and second only for pairs with a distance of at least five residues. The third table represents the ten best residue pairs with DI score calculated with EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN) which is a newer form of the DI-score calculated by EVcouplings. </figtable>

Comparing the best ten CN-scores for all and for residue pairs with at least a distance of more than five amino acids. It is remarkable that for all pairs the ten best ones are directly neighboring amino acids and have almost twice of the CN-score (<xr id="cn-ras"/>) than the ten best of the extracted pairs, for which the CN-scores classify between -0.65 and 3.40. The range of all pairs, however, goes up to 6.01. The distribution can be viewed in <xr id="cn_distr_ras"/>. Additionally the distribution of all residue pairs is shown, which is very similar to the extracted one. For the DI-score calculation using EVcouplings a range between 0.00 and 0.22 was achieved. Five of the ten best results are in common for extracted residue pairs found with freecontact and EVcouplings, which are 10(G)-16(K), 11(A)-92(D), 81(V)-116(N), 82(F)-141(Y) and 130(A)-141(Y). Remarkably, there is a nucleotide binding at residue ten to 17 which can be the cause for the strong binding of guanine and lysine. The other four pairs seem to help connect β-strands with other β-strands or α-helices.

Another measuring is the L-score, for which the three best residues are 82 (Phenylalanine) with a score of 9.96, 81 (Valine) with a score of 7.4 and 141 (Tyrosine) with a score of 6.7. Noticeably all three are included in the five residue pairs mentioned in the section before. Nevertheless, for none of this three residues a mutation is reported in Uniprot. However, all three are included in β strands, which might make them important for the structure stability. Near the first two residues there is a mutagenesis (residue 83) observable.

In <xr id="tp_ras"/> the true positive rate for the combination of always 50 residue pairs is shown. Like expected the best CN-score is true positive in most cases. So, for the best ten results shown in <xr id="cn-ras"/> above only one, the third pair 87-129, is predicted wrong. The lower the CN-score the more false positives can be viewed which also can be shown with the pearson correlation as a coefficient of -0.78 was reached indicating a strong negative correlation. All sequences having a CN-score bigger than 1 are said to be high scoring. Those are 65 of 11786 for the extracted pairs which is less than 1%. Thereby, only ten pairs are more distant than five Å and therefore said to be true positives whereas the other 55 are false positives.

<xr id="cm_ras"/> represents the contact map of the freecontact prediction for the best 65% of the protein length, which are the first 104 pairs, in comparison with the mapping annotated in the pdb structure. As you can see, the prediction is very good like it already was seen in <xr id="tp_ras"/>.

We also compared the first 50 residue pairs of the extracted pair of freecontact and EVcouplings. Thereby only 19 pairs are found with both methods.

PAH

<figure id="cn_distr_pah">

CN-score distribution of PAH domain biopterin calculated with freecontact for all (green) and extracted residue pairs with a distance of at least six amino acids (purple).

</figure><figure id="tp_pah">

rightRate of the true positives for the sorted and extracted residue pairs. Always 200 pairs are taken together. The y-axis represents, how many of the 200 pairs are true positives or false positives, where 0 means that all 200 pairs are tp and 1 that all 200 pairs are fp.

</figure>

<figure id="cm_pah">

Contact map of the best 294 residue pairs of the freecontact prediction (red). In grey the contact map of the pdb structure 2pah is shown.

</figure>

<figtable id="cn-pah">

Ten best results with highest CN for all and for extracted pairs
All pairs
pos1 aa1 pos2 aa2 MI CN
31 Q 32 N 0.95 3.77
110 S 111 R 1.08 3.08
27 D 28 N 0.76 3.01
27 D 29 C 0.68 2.75
24 Y 25 I 0.65 2.66
16 S 17 D 0.55 2.55
27 D 30 N 0.61 2.48
17 D 18 F 0.50 2.47
29 C 30 N 0.66 2.47
22 T 23 S 0.72 2.46
Extracted pairs
pos1 aa1 pos2 aa2 MI CN
352 G 382 F 0.54 1.96
324 I 399 V 0.53 1.83
342 A 354 L 0.52 1.80
257 G 264 H 1.22 1.62
16 S 23 S 0.53 1.61
16 S 22 T 0.57 1.56
17 D 23 S 0.49 1.56
46 G 52 L 0.98 1.51
326 W 377 Y 0.61 1.48
347 L 354 L 0.42 1.47
EVcouplings-extracted
pos1 aa1 pos2 aa2 MI CN
257 G 264 H 1.15 0.25
342 A 354 L 0.81 0.22
352 G 382 F 0.54 0.20
192 K 221 E 0.74 0.18
282 D 290 H 0.24 0.17
174 I 218 G 0.45 0.16
347 L 354 L 0.48 0.15
365 L 385 L 0.80 0.14
326 W 377 Y 0.36 0.13
235 Q 241 R 0.56 0.13

The ten results of freecontact calculated for the biopterin domain of PAH with the highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the best ten DI-scores also for extracted pairs calculated by EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN). </figtable>

For PAH the CN-scores of freecontact for all residue pairs are ranged between -0.79 and 3.77, for extracted pairs they are between -0.79 and 1.96 and the DI scores calculated with EVcouplings can be found in a range of 0.00 to 0.25. In <xr id="cn_distr_pah"/> the distribution of the CN-scores for all and extracted residue pairs can be viewed. Again in <xr id="cn-pah"/> the table with all residue pairs only includes pairs with a lower distant than one, however for this protein some have a distant of two or even three residues. Additionally the DI-scores of EVcouplings are lower than for freecontact, but never negative. At comparing the pairs, five results can be found which are included in the ten best results for the extracted residue pairs calculated with freecontact and with EVcouplings. Those residue pairs are 352(G)-382(F), 342(A)-354(L), 257(G)-264(H), 326(W)-377(Y) and 347(L)-354(L). Those pairs seem to stabilize the structure as they connect helices and strands with each other. Looking at the distribution of the true positives (<xr id="tp_pah"/>) you can see more false positives than for HRAS. Here, always 200 pairs are combined. Nevertheless, still a correlation between the CN-Score and the true positive rate can be viewed with a pearson correlation coefficient of about -0.70.

Remarkably the first two residues with the best L-Score, which are residue 19 (Glycine) with a score of 6.4 and residue 26 (Glutamine acid) with a score of 6, lay before the first domain (ACT). At residue 20, which is the neighbour of residue 19 with best L-score has a natural variant associated with Hyperphenylalaninemia annotated in Uniprot and dbSNP. Additionally for residue 165 (Alanine), which has the third best L-Score of 5.11, the neighboring positions 164 and 167 have natural variants annotated in Uniprot which are associated with PKU. In dbSNP, which we searched in Task7 (Lab Journal) we also found a SNP for residue 165 itself (rs199475626), but it is not yet tested. Furthermore there seems to be a conservation of alanine in the MSA of Task2.

In <xr id="cm_pah"/> the best 294 (=65% of the protein length) residue pairs are shown in a contact map (red) using 2PAH (grey) as reference contacts. As you can see they match pretty good, nevertheless there are some false positives. However, many of those false positives seem to be at least close to the reference structure. Again the overlap between the best 50 pairs found with freecontact and with EVcouplings only lies at 14 residue pairs. Nevertheless, in this case it can be caused by the fact that for freecontact the whole sequence was used, but for EVcouplings we started at residue 106 to get an alignment.

Calculate structural model

Lab journal

H-Ras

<figure id="ras_rmsd">

Boxplot of the three predictions with different number of constraints (64, 104 and 160) which is 40%, 65% and 100% of the protein length of H-Ras. The RMSD-values of the five models for each prediciton are used and calculated with LGA.

</figure> <figure id="ras_evfold">

a) Number constraints: 64.
b) Number constraints: 104.
c) Number constraints: 160.
Contact maps created at structure predictions using different number of constraints for H-Ras.

</figure>

In <xr id="ras_evfold"/> the contact maps of the structure predictions are shown. There you can see that for 64 constraints the prediction matches pretty well. Nevertheless many regions are not covered. In contrast for 160 there are more regions that are covered or at least stronger indicated. However, in this case there are much more false positives. Using about 65% of the protein length as number of constraints, for this protein 104, seems to be a good compromise between the other two chosen numbers. More regions as for the first are covered, but with less false positives than for the last one.

Looking at the RMSD values calculated for the Cα atomes using LGA for the models created with EVfold scores of about 3 Å can be seen, which shows the similarity between prediction and pdb structure. Using 40% of the protein length as constraint number the RMSD values range between 2.16 and 3.24, for 65% they lie between 3.03 and 3.50 and for 100% between 3.20 and 3.63. This can also be seen in <xr id="ras_rmsd"/>.

PAH

<figure id="pah_rmsd">

Boxplot of the three predictions with different number of constraints (138, 225 and 346) which is 40%, 65% and 100% of the protein length of PAH. The RMSD-values of the five models for each prediciton are used and calculated with LGA.

</figure>

<figure id="biopterin_evfold">

a) Number constraints: 138.
b) Number constraints: 225.
c) Number constraints: 346.
Contact maps created at structure predictions using different number of constraints for PAH-Biopterin.

</figure> Like for H-Ras structure prediction with 65% of the protein length (b) as constraint number seems to be a good tradeoff between to few (a) or to many (c) constraints to cover the true structure (<xr id="biopterin_evfold"/>). Nevertheless, in this case there are more false positives already found for 138 constraints and so without the reference structure beneath it would be very hard to see the true pattern. However, for 346 constraints the false positives are so many they build some false hotspots. Altogether the structure prediction is not so good for this protein. Maybe this is caused by the smaller protein family and therefore a smaller multiple alignment. Additionally the RMSD values show the same result as the contact map with scores around 3.3 and 3.6 which is worse than for the H-Ras structure predictions. At a constraint number of 40% of the protein length (138) the RMSD values range between 3.21 and 3.62, at a number of 65% of the protein length (225) they reach scores between 3.23 and 3.68 and for 100% they are between 3.16 and 3.56. Remarkably, highest RMSD values are found for the 65% constraint. This can also be seen in figure <xr id="pah_rmsd"/>.

References

<references/>