Canavan Disease: Task 06 - Protein Structure Prediction

From Bioinformatikpedia
Revision as of 19:34, 19 August 2013 by Mahlich (talk | contribs) (HRAS)

LabJournal

Dataset

To gain the HRas multiple sequence alignment the instructions were followed and the full MSA provided by Pfam (PF00071) was downloaded and used for further calculations and statistics. Searching for a multiple sequence alignment for ASPA/ACY2 in Pfam revealed that the two criteria to gain meaningful insights out of the calculations of freecontact, evcouplings and evfold, namely over 1000 sequences in the MSA and large parts of the reference sequence are contained in the MSA, are satisfied. The multiple sequence alignment for the protein family containing ASPA (PF04952) includes 2822 sequences and the region of ASPA that is used in the MSA spans from position 10 to 301 with ASPA having a total length of 313 amino acids. Hence the Pfam MSA is regarded as viable input for the following calculations.

HRAS

Freeconact is based upon searching conserved regions and correlated mutations in a multiple sequence alignment, to predict pairs of residues that are in contact in a protein. It is to be expected that residues that are close to each other in sequence are as well close in three dimensional space, as their contact often defines the secondary structure elements and the conformation of the protein on a small scale. Therefore residue pairs that are close in sequence are ranked with a high CN score by freecontact. However more meaning full for the overall conformation of the protein are stabilizing contacts between residues that are more distant in sequence space. This is the reason why filtering the predicted contacts to exclude residues that are distant more than five residues in sequence. Looking at the distribution of the CN scores (<xr id="hras_cn_distribution">Figure </xr>) this gets visible as well.

</figure>

</figure>

<figure id="hras_cn_distribution">

The distribution of the CN scores for HRAS calculated by freecontact. The frequency of the CN scores are displayed for all pairs of residues (orange) and pairs of residues more than five positions apart in the sequence (blue). Pairs with a CN score of above 1 are considered high scoring. It is visible that only a tiny fraction of the pairs are high scoring, as well as the reduction of the set to pairs of a sequence distance of more than five has a huge impact of the amount of high scoring pairs.

<figure id="hras_freecontact_contactmap">

Contact map of HRAS (121P) calculated form the crystal structure. Displayed in grey are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 5 to 165).

The first thing to be noted is that only a tiny fraction (514 out of 12561 possible pairs) has a CN score > 1, what is considered to be high scoring. If the set is reduced to residue pairs with a sequence distance greater five this subset of high scoring pairs is imideately reduced to 65 pairs. Secondly the maximal CN scores is reduced from 6.01 to 3.40. Reducing the set however has no great impact on the precision. The predicted high scoring contacts of the orginal set contain 439 true positives and 75 false positives (precision of 0.854) while the reduced set contains 55 true positive predictions out of 65 predictions over all (precision of 0.846). The predicted contacts are visualized together with the actual contacts calculated with the aid of the crystal structure in <xr id="hras_freecontact_contactmap"> Figure </xr>. A overview of the top 10 predictions for HRAS in more detail are displayed in <xr id="top_20_hras"> Table </xr>.

<figtable id="top_20_hras">

Predicted residue contacts for HRAS by freecontact
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
11 A 92 D 3.40454 TP
81 V 116 N 2.99937 TP
87 T 129 Q 2.68523 FP
82 F 141 Y 2.52755 TP
84 I 115 G 2.52502 TP
19 L 81 V 2.50464 TP
82 F 115 G 2.41709 TP
10 G 16 K 2.26384 TP
130 A 141 Y 2.24938 TP
123 R 143 E 2.21315 TP
Predicted residue contacts for HRAS by EVcouplings
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
10 G 16 K 0.2058040 TP
13 G 21 I 0.1126380 FP
11 A 92 D 0.0946234 TP
87 T 129 Q 0.0766384 FP
114 V 155 A 0.0760636 TP
117 K 145 S 0.0758184 TP
82 F 141 Y 0.0757755 TP
116 N 146 A 0.0755790 TP
35 T 60 G 0.0707763 TP
81 V 116 N 0.0665576 TP
Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5). The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings. Of the 10 residue pairs calculated by freecontact only one (Thr 87 -> Gln 129) has no actual contact when compared to the crystal structure. Within the top 10 residue pairs calculated by EVcouplings two are false positive. Gly 13 -> Ile 21 and interestingly Thr 87 -> Gln 129, the pair miss predicted as well by freecontact.

</figtable>

Searching for evolutionary hotspots, the L best high-scoring residue couplings with a sequence distance greater than five, where L is the length of the aligned sequence in the multiple sequence alignment were extracted. In the case of HRAS freecontact used a sequence part of length 160 to create the couplings. The CN scores of these 160 couplings are then summed up for each amino acid. If these sums are normalized (dividing the sums by 160) they can give hints how evolutionary important the amino acid is. Performing this procedure resulted in the observation that for HRAS Phe 82, Val 81, Tyr 141, Glu 143 and Gly 115 (in descending order) seem to be the evolutionary most important residues in terms of forming and stabilizing the protein.

A further possibility to predict contacts apart from freecontact is using EVcouplings. The results EVcouplings delivers are filtered the same way as the freecontact results. All scores for residue pairs that are less than five sequential positions apart are excluded. The remaining couplings are sorted after their DI score (a former version of the CN score). Comparing the top 50 DI scores from EVcouplings and CN scores from freecontact an overlap of 20 couplings can be observed. However within the 10 best pairs of each Method, there is a overlap of five pairs, namely Gly 10 -> Lys 16, Ala 11 -> Asp 92, Thr 87 -> Gln 129, Phe 82 -> Tyr 141, Val 81-> Asn 116. Interestingly one of the residue pairs (Thr 87 -> Gln 129), that is predicted by both methods is a false positive. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in <xr id="top_20_hras"> Table </xr>.

ASPA

Tasks