Protein structure prediction from evolutionary sequence variation (Phenylketonuria)
According to the huge number of known sequences and the low quantity of known structures it is an important challenge to predict the structure of proteins from sequence information alone. Mostly the sequence of the interested protein is part of an evolutionarily related family of sequences, which are probable to have the same fold fundamentally. The evolutionary variation in sequences is limited by a number of necessities, like e.g. the maintenance of favourable interactions in direct residue-residue associations. To find out which residue pairs are lying in the 3D structure close to each other, one uses pair correlations in the multiple alignments. Then a subset of the predicted residue contact pairs is used to fold up any protein of the family into an approximated 3D structure. This is subsequently refined with different techniques to generate a finished structure for the protein of interest. <ref name="evol"> Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina and Chris Sander (2011): "Protein 3D Structure Computed from Evolutionary Sequence Variation". PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 </ref> So for the structure prediction based on the sequence, we first downloaded a multiple sequence alignment of Ras (PF00071) from the Pfam database and generated one for our protein PAH with PSI-BLAST and ClustalW. Then, we used freecontact and EVfold for the structure prediction. Subsequentely, we calculated the structural models for both proteins. Our results and further discussions are following in the sections below.
Multiple Sequence Alignment
For our protein two domains are included (ACT and Biopterin domain).
Freecontact is a program which takes a multiple alignment to calculate the mutual information score and the corrected norm contact score. In one step, residue pairs that are separated by more than five residues are extracted. This is necessary because if they are to close they mostly get high CN-scores as they have local connections. However, for structure prediction major distant residues are more interesting as connections between distant amino acids since this might lead to the compact structure.
</figure> <figure id="cm_ras">
|Ten best results with highest CN for all and for extracted pairs|
The ten best results of
freecontact calculated for H-RAS with the highest CN-value first calculated for all pairs and second only for pairs with a distance of at least five residues. The third table represents the ten best residue pairs with DI score calculated with
EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN) which is a newer form of the DI-score calculated by EVcouplings.
Comparing the best ten CN-scores for all and for residue pairs with at least a distance of more than five amino acids. It is remarkable that for all pairs the ten best ones are directly neighboring amino acids and have almost twice of the CN-score (<xr id="cn-ras"/>) than the ten best of the extracted pairs, for which the CN-scores classify between -0.65 and 3.40. The range of all pairs, however, goes up to 6.01. The distribution can be viewed in <xr id="cn_distr_ras"/>. Additionally the distribution of all residue pairs is shown, which is very similar to the extracted one. For the DI-score calculation using
EVcouplings a range between 0.00 and 0.22 was achieved. Five of the ten best results are in common for extracted residue pairs found with
EVcouplings, which are 10(G)-16(K), 11(A)-92(D), 81(V)-116(N), 82(F)-141(Y) and 130(A)-141(Y). Remarkably, there is a nucleotide binding at residue ten to 17 which can be the cause for the strong binding of guanine and lysine. The other four pairs seem to help connect β-strands with other β-strands or α-helices.
Another measuring is the L-score, for which the three best residues are 82 (Phenylalanine) with a score of 9.96, 81 (Valine) with a score of 7.4 and 141 (Tyrosine) with a score of 6.7. Noticeably all three are included in the five residue pairs mentioned in the section before. Nevertheless, for none of this three residues a mutation is reported in Uniprot. However, all three are included in β strands, which might make them important for the structure stability. Near the first two residues there is a mutagenesis (residue 83) observable.
In <xr id="tp_ras"/> the true positive rate for the combination of always 50 residue pairs is shown. Like expected the best CN-score is true positive in most cases. So, for the best ten results shown in <xr id="cn-ras"/> above only one, the third pair 87-129, is predicted wrong. The lower the CN-score the more false positives can be viewed which also can be shown with the pearson correlation as a coefficient of -0.78 was reached indicating a strong negative correlation. All sequences having a CN-score bigger than 1 are said to be high scoring. Those are 65 of 11786 for the extracted pairs which is less than 1%. Thereby, only ten pairs are more distant than five Å and therefore said to be true positives whereas the other 55 are false positives.
<xr id="cm_ras"/> represents the contact map of the
freecontact prediction for the best 65% of the protein length, which are the first 104 pairs, in comparison with the mapping annotated in the pdb structure. As you can see, the prediction is very good like it already was seen in <xr id="tp_ras"/>.
We also compared the first 50 residue pairs of the extracted pair of
EVcouplings. Thereby only 19 pairs are found with both methods.
|Ten best results with highest CN for all and for extracted pairs|
The ten results of
freecontact calculated for the biopterin domain of PAH with the highest CN-value first for all pairs and second only for pairs with a distance of at least five residues. The third table represents the best ten DI-scores also for extracted pairs calculated by EVcouplings. All three tables show the position of the first residue in column one with its corresponding amino acid (column 2) and of the second residue in column three with its amino acid in column four. The next one represents the mutual information score (MI) and the last one the corrected norm contact score(CN).
For PAH the CN-scores of
freecontact for all residue pairs are ranged between -0.79 and 3.77, for extracted pairs they are between -0.79 and 1.96 and the DI scores calculated with
EVcouplings can be found in a range of 0.00 to 0.25. In <xr id="cn_distr_pah"/> the distribution of the CN-scores for all and extracted residue pairs can be viewed. Again in <xr id="cn-pah"/> the table with all residue pairs only includes pairs with a lower distant than one, however for this protein some have a distant of two or even three residues. Additionally the DI-scores of
EVcouplings are lower than for
freecontact, but never negative. At comparing the pairs, five results can be found which are included in the ten best results for the extracted residue pairs calculated with
freecontact and with
EVcouplings. Those residue pairs are 352(G)-382(F), 342(A)-354(L), 257(G)-264(H), 326(W)-377(Y) and 347(L)-354(L). Those pairs seem to stabilize the structure as they connect helices and strands with each other.
Looking at the distribution of the true positives (<xr id="tp_pah"/>) you can see more false positives than for HRAS. Here, always 200 pairs are combined. Nevertheless, still a correlation between the CN-Score and the true positive rate can be viewed with a pearson correlation coefficient of about -0.70.
Remarkably the first two residues with the best L-Score, which are residue 19 (Glycine) with a score of 6.4 and residue 26 (Glutamine acid) with a score of 6, lay before the first domain (ACT). At residue 20, which is the neighbour of residue 19 with best L-score has a natural variant associated with Hyperphenylalaninemia annotated in Uniprot and dbSNP. Additionally for residue 165 (Alanine), which has the third best L-Score of 5.11, the neighboring positions 164 and 167 have natural variants annotated in Uniprot which are associated with PKU. In dbSNP, which we searched in Task7 (Lab Journal) we also found a SNP for residue 165 itself (rs199475626), but it is not yet tested. Furthermore there seems to be a conservation of alanine in the MSA of Task2.
In <xr id="cm_pah"/> the best 294 (=65% of the protein length) residue pairs are shown in a contact map (red) using 2PAH (grey) as reference contacts. As you can see they match pretty good, nevertheless there are some false positives. However, many of those false positives seem to be at least close to the reference structure. Again the overlap between the best 50 pairs found with
freecontact and with
EVcouplings only lies at 14 residue pairs. Nevertheless, in this case it can be caused by the fact that for
freecontact the whole sequence was used, but for
EVcouplings we started at residue 106 to get an alignment.
Calculate structural model
</figure> <figure id="ras_evfold">
In <xr id="ras_evfold"/> the contact maps of the structure predictions are shown. There you can see that for 64 constraints the prediction matches pretty well. Nevertheless many regions are not covered. In contrast for 160 there are more regions that are covered or at least stronger indicated. However, in this case there are much more false positives. Using about 65% of the protein length as number of constraints, for this protein 104, seems to be a good compromise between the other two chosen numbers. More regions as for the first are covered, but with less false positives than for the last one.
Looking at the RMSD values calculated for the Cα atomes using LGA for the models created with
EVfold scores of about 3 Å can be seen, which shows the similarity between prediction and pdb structure. Using 40% of the protein length as constraint number the RMSD values range between 2.16 and 3.24, for 65% they lie between 3.03 and 3.50 and for 100% between 3.20 and 3.63. This can also be seen in <xr id="ras_rmsd"/>.
</figure> Like for H-Ras structure prediction with 65% of the protein length (b) as constraint number seems to be a good tradeoff between to few (a) or to many (c) constraints to cover the true structure (<xr id="biopterin_evfold"/>). Nevertheless, in this case there are more false positives already found for 138 constraints and so without the reference structure beneath it would be very hard to see the true pattern. However, for 346 constraints the false positives are so many they build some false hotspots. Altogether the structure prediction is not so good for this protein. Maybe this is caused by the smaller protein family and therefore a smaller multiple alignment. Additionally the RMSD values show the same result as the contact map with scores around 3.3 and 3.6 which is worse than for the H-Ras structure predictions. At a constraint number of 40% of the protein length (138) the RMSD values range between 3.21 and 3.62, at a number of 65% of the protein length (225) they reach scores between 3.23 and 3.68 and for 100% they are between 3.16 and 3.56. Remarkably, highest RMSD values are found for the 65% constraint. This can also be seen in figure <xr id="pah_rmsd"/>.