Protein structure prediction from evolutionary sequence variation (Phenylketonuria)
NOT FINISHED YET
According to the huge number of known sequences and the low quantity of known structures it is an important challenge to predict the structure of proteins from sequence information alone. Mostly the sequence of the interested protein is part of an evolutionarily related family of sequences, which are probable to have the same fold fundamentally. The evolutionary variation in sequences is limited by a number of necessities, like e.g. the maintenance of favourable interactions in direct residue-residue associations. To find out which residue pairs are lying in the 3D structure close to each other, one uses pair correlations in the multiple alignments. Then a subset of the predicted residue contact pairs is used to fold up any protein of the family into an approximated 3D structure. This is subsequently refined with different techniques to generate a finished structure for the protein of interest. <ref name="evol"> Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina and Chris Sander (2011): "Protein 3D Structure Computed from Evolutionary Sequence Variation". PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref> So for the structure prediction based on the sequence, we first downloaded a multiple sequence alignment of Ras (PF00071) from the Pfam database and generated one for our protein PAH with PSI-BLAST and ClustalW. Then, we used freecontact and EVfold for the structure prediction. Subsequentely, we calculated the structural models for both proteins. Our results and further discussions are following in the sections below.
Multiple Sequence Alignment
For our protein two domains are included (ACT and Biopterin domain).
Freecontact is a program that takes a multiple alignment and can calculate the mutual information score and the corrected norm contact score. In one step residue pairs that are separated by more than five residues are extracted, because if they are to close they mostly get high CN-scores as they have local connections. However, for structure prediction more distant residues are more interesting as connections between distant amino acids might lead to its compact structure.
</figure> <figure id="cm_ras">
|Ten best results with highest CN for all and for extracted pairs|
The ten best results of
freecontact calculated for H-RAS with highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the ten best residue pairs with DI score calculated with
EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN) which is a newer form of the DI-score calculated by EVcouplings.
Comparing the best ten CN-scores for all and for residue pairs with at least a distant of more than five amino acids it is remarkable that for all pairs the ten best all are directly neighboring amino acids and have almost twice of the CN-score (<xr id="cn-ras"/>) than the ten best of the extracted pairs, for which the CN-scores ranges between -0.65 and 3.40. The range of all pairs, however, goes up to 6.01. The distribution can be viewed in <xr id="cn_distr_ras"/>. Additionally the distribution of all residue pairs is shown, which is very similar to the extracted one. For the DI-score calculation using
EVcouplings a range between 0.00 and 0.22 was achieved. Five of the ten best results are in common for extracted residue pairs found with
EVcouplings, which are 10(G)-16(K), 11(A)-92(D), 81(V)-116(N), 82(F)-141(Y) and 130(A)-141(Y).
Another measuring is the L-score, for which the three best residues are 82 (Phenylalanine) with a score of 9.96, 81 (Valine) with a score of 7.4 and 141 (Tyrosine) with a score of 6.7. For none of this three residues a mutation is reported in Uniprot. However, all three are included in β strands, which might make them important for the structure stability. Near the first two residues there is a mutagenesis (residue 83).
In <xr id="tp_ras"/> the true positive rate for the combination of always 50 residue pairs is shown. Like expected the best CN-score are true positive in most cases. So for the best ten results shown in <xr id="cn-ras"/> above only one, the third pair 87/129, is predicted wrong. The lower the CN-score the more false positives can be viewed which also can be shown with the pearson correlation as a coefficient of -0.78 was reached indicating a strong negativ correlation. All sequences which have a CN-score bigger than 1 are said to be high scoring, which are only 65 of 11786 for the extracted pairs, of which only ten pairs are more distant than five Å and therefore false positives.
<xr id="cm_ras"/> represents the contact map of the
freecontact prediction for the best 65% of the protein length, which are the first 104 pairs, in comparison with the mapping annotated in the pdb structure. As you can see, the prediction is very good like it already was seen in <xr id="tp_ras"/>.
only 19 of 50 overlapping residue pairs (CN-DI)...
|Ten best results with highest CN for all and for extracted pairs|
The ten results of
freecontact calculated for the biopterin domain of PAH with highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the best ten DI-scores also for extracted pairs calculated by EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN).
For PAH the CN-scores of
freecontact for all residue pairs are ranged between -0.79 and 3.77, for extracted pairs they are between -0.79 and 1.96 and the DI scores calculated with
EVcouplings can be found in a range between 0.00 and 0.25. In <xr id="cn_distr_pah"/> the distribution of the CN-scores for all and extracted residue pairs can be viewed. Again in <xr id="cn-pah"/> the table with all residue pairs only includes pairs with a lower distant than one, however for this protein some have a distant of two or even three residues. Additionally the DI-scores of
EVcouplings are lower than for
freecontact, but never are negative. At comparing the pairs, five results can be found which are included in the ten best results for the extracted residue pairs calculated with
freecontact and with
EVcouplings. Those residue pairs are 352(G)-382(F), 342(A)-354(L), 257(G)-264(H), 326(W)-377(Y) and 347(L)-354(L).
Looking at the distribution of the true positives (<xr id="tp_pah"/>) you can see more false positives than for HRAS. Here always 200 pairs are combined. Nevertheless, still a correlation between CN-Score and true positive rate can be viewed with a pearson correlation coefficient of about -0.70.
L-Score, three best results:
19 6.4 (Glycine) 26 6 (Glutamin acid) 165 5.11 (Alanine)
Remarkably the first two residues with best L-Score lay before the first domain (ACT). Beneath the best one - residue 19 - residue 20 has a natural variant associated with Hyperphenylalaninemia annotated in Uniprot. Additionally for residue 165, which has the third best L-Score, the neighboring positions 164 and 167 have natural variants annotated in Uniprot which are associated with PKU.
Calculate structural model
</figure> <figure id="ras_evfold">
In <xr id="ras_evfold"/> the contact maps of the structure predictions are shown. There you can see that for 64 constraints the prediction matches pretty well. Nevertheless many regions are not covered. In contrast for 160 there are more regions that are covered or at least stronger indicated. However, in this case there are much more false positives. Using about 65% of the protein length as number of constraints, for this protein 104, seems to be a good compromise between the other two chosen numbers. More regions as for the first are covered, but with less false positives than for the last one.
Looking at the RMSD values calculated for the Cα atomes using LGA for the models created with
EVfold scores of about 3 Å can be seen, which shows the similarity between prediction and pdb structure. Using 40% of the protein length as constraint number the RMSD values range between 2.16 and 3.24, for 65% they lie between 3.03 and 3.50 and for 100% between 3.20 and 3.63. This can also be seen in <xr id="ras_rmsd"/>.
</figure> Like for H-Ras structure prediction with 65% of the protein length (b) as constraint number seems to be a good tradeoff between to few (a) or to many (c) constraints to cover the true structure (<xr id="biopterin_evfold"/>. Nevertheless here there are more false positives already found for 138 constraints and so without the reference structure beneath it would be very hard to see the true pattern. However, for 346 constraints the false positives are so many they build some false hotspots. Altogether the structure prediction is not so good for this protein. Maybe this is caused by the smaller protein family and therefore a smaller multiple alignment. Additionally the RMSD values show the same result as the contact map with scores around 20-22 which is nearly twice the score than for the H-Ras structure predictions. At a constraint number of 40% of the protein length (138) the RMSD values range between 20.24 and 24.10, at a number of 65% of the protein length (225) they reach scores between 19.89 and 24.19 and for 100% they are between 18.94 and 24.26. This can also be seen in figure... (boxplot pah_rmsd.png)