Difference between revisions of "Canavan Disease: Task 06 - Protein Structure Prediction"

From Bioinformatikpedia
(Calculation of structural models)
(Dataset)
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
'''Protein structure prediction from evolutionary sequence variation''' is another approach of finding a protein structure using evolutionary couplings. So a structure can be found without using any 3D informations.
== [[Canavan_Disease:_Task_06_-_Journal|LabJournal]] ==
 
 
 
==Dataset==
 
==Dataset==
   
To gain the '''HRas''' multiple sequence alignment the instructions were followed and the full MSA provided by Pfam ([http://pfam.sanger.ac.uk/family/Ras '''PF00071''']) was downloaded and used for further calculations and statistics. Searching for a multiple sequence alignment for ASPA/ACY2 in Pfam revealed that the two criteria to gain meaningful insights out of the calculations of freecontact, evcouplings and evfold, namely over 1000 sequences in the MSA and large parts of the reference sequence are contained in the MSA, are satisfied. The multiple sequence alignment for the '''protein family containing ASPA''' ([http://pfam.sanger.ac.uk/family/PF04952 '''PF04952''']) includes '''2822 sequences''' and the region of ASPA that is used in the '''MSA spans from position 10 to 301''' with ASPA having a total length of 313 amino acids. Hence the Pfam MSA is regarded as viable input for the following calculations.
+
To gain the '''HRAS''' multiple sequence alignment the instructions were followed and the full MSA provided by Pfam ([http://pfam.sanger.ac.uk/family/Ras '''PF00071''']) was downloaded and used for further calculations and statistics. Searching for a multiple sequence alignment for aspartoacylase (ASPA) in Pfam revealed that the two criteria to gain meaningful insights out of the calculations of '''freecontact''', '''[http://evfold.org/evfold-web/evfold.do EVcouplings]''' and '''[http://evfold.org/evfold-web/evfold.do EVfold]''', namely over 1000 sequences in the MSA and large parts of the reference sequence are contained in the MSA, are satisfied. The multiple sequence alignment for the '''protein family containing ASPA''' ([http://pfam.sanger.ac.uk/family/PF04952 '''PF04952''']) includes '''2822 sequences''' and the region of ASPA that is used in the '''MSA spans from position 10 to 301''' with ASPA having a total length of 313 amino acids. Hence the Pfam MSA is regarded as viable input for the following calculations.
   
 
==HRAS==
 
==HRAS==
   
=== Calculation and analysis of correlated mutations ===
+
=== Calculation and Analysis of Correlated Mutations ===
   
Freeconact is based upon searching conserved regions and correlated mutations in a multiple sequence alignment, to predict pairs of residues that are in contact in a protein. It is to be expected that residues that are close to each other in sequence are as well close in three dimensional space, as their contact often defines the secondary structure elements and the conformation of the protein on a small scale. Therefore residue pairs that are close in sequence are ranked with a high CN score by freecontact. However more meaning full for the overall conformation of the protein are stabilizing contacts between residues that are more distant in sequence space. This is the reason why filtering the predicted contacts to exclude residues that are distant more than five residues in sequence. Looking at the distribution of the CN scores ('''<xr id="hras_cn_distribution">Figure </xr>''') this gets visible as well.
+
freecontact is based upon searching conserved regions and correlated mutations in a multiple sequence alignment, to predict pairs of residues that are in contact in a protein. It is to be expected that residues that are close to each other in sequence are as well close in three dimensional space, as their contact often defines the secondary structure elements and the conformation of the protein on a small scale. Therefore residue pairs that are close in sequence are ranked with a high CN score by freecontact. However more meaningful for the overall conformation of the protein are stabilizing contacts between residues that are more distant in sequence space. This is the reason for filtering the predicted contacts to exclude residues that are distant less than five residues in sequence. Looking at the distribution of the CN scores ('''<xr id="hras_cn_distribution"></xr>''') this gets visible as well.
   
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 15: Line 14:
 
|align="center"|
 
|align="center"|
 
<figure id="hras_cn_distribution">
 
<figure id="hras_cn_distribution">
[[Image:HRAS cn-score frequency.png|centre|thumb|400px|'''<caption>'''The distribution of the CN scores for HRAS calculated by freecontact. The frequency of the CN scores are displayed for all pairs of residues (orange) and pairs of residues more than five positions apart in the sequence (blue). Pairs with a CN score of above 1 are considered high scoring. It is visible that only a tiny fraction of the pairs are high scoring, as well as the reduction of the set to pairs of a sequence distance of more than five has a huge impact of the amount of high scoring pairs.</caption>]]
+
[[Image:HRAS cn-score frequency.png|centre|thumb|400px|'''<caption>'''Distribution of the CN scores for HRAS calculated by freecontact. The frequencies of the CN scores are displayed for all pairs of residues (orange) and pairs of residues more than five positions apart in the sequence (blue). Pairs with a CN score of above 1 are considered high scoring. It is visible that only a tiny fraction of the pairs are high scoring, as well as the reduction of the set to pairs of a sequence distance of more than five has a huge impact of the amount of high scoring pairs.</caption>]]
 
</figure>
 
</figure>
 
|align="center"|
 
|align="center"|
 
<figure id="hras_freecontact_contactmap">
 
<figure id="hras_freecontact_contactmap">
[[Image:HRAS contact map freecontact.png|centre|thumb|400px|'''<caption>'''Contact map of HRAS ([http://www.pdb.org/pdb/explore/explore.do?structureId=121p '''121P''']) calculated form the crystal structure. Displayed in grey are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 5 to 165).</caption>]]
+
[[Image:HRAS contact map freecontact.png|centre|thumb|400px|'''<caption>'''Contact map of HRAS ([http://www.pdb.org/pdb/explore/explore.do?structureId=121p '''121P''']) calculated form the crystal structure. Displayed in gray are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 5 to 165).</caption>]]
 
</figure>
 
</figure>
 
|-
 
|-
 
|}
 
|}
   
The first thing to be noted is that only a tiny fraction (514 out of 12561 possible pairs) has a CN score > 1, what is considered to be high scoring. If the set is reduced to residue pairs with a sequence distance greater five this subset of high scoring pairs is imideately reduced to 65 pairs. Secondly the maximal CN scores is reduced from 6.01 to 3.40. Reducing the set however has no great impact on the precision. The predicted high scoring contacts of the orginal set contain 439 true positives and 75 false positives (precision of 0.854) while the reduced set contains 55 true positive predictions out of 65 predictions over all (precision of 0.846). The predicted contacts are visualized together with the actual contacts calculated with the aid of the crystal structure in '''<xr id="hras_freecontact_contactmap"> Figure </xr>'''. A overview of the top 10 predictions for HRAS in more detail are displayed in '''<xr id="top_20_hras"> Table </xr>'''.
+
The first thing to be noted is that only a tiny fraction (514 out of 12561 possible pairs) has a CN score > 1, what is considered to be high scoring. If the set is reduced to residue pairs with a sequence distance greater five this subset of high scoring pairs is immediately reduced to 65 pairs. Secondly the maximal CN scores is reduced from 6.01 to 3.40. Reducing the set however has no great impact on the precision. The predicted high scoring contacts of the original set contain 439 true positives and 75 false positives (precision of 0.854) while the reduced set contains 55 true positive predictions out of 65 predictions over all (precision of 0.846). The predicted contacts are visualized together with the actual contacts calculated with the aid of the crystal structure in '''<xr id="hras_freecontact_contactmap"></xr>'''. An overview of the top 10 predictions for HRAS in more detail are displayed in '''<xr id="top_20_hras"></xr>'''.
   
 
<figtable id="top_20_hras">
 
<figtable id="top_20_hras">
Line 32: Line 31:
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="6" style="background:#87CEFA;" | Predicted residue contacts for HRAS by freecontact
+
! colspan="6" style="background:#87CEFA;" | Predicted Residue Contacts for HRAS by freecontact
 
|-
 
|-
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
Line 68: Line 67:
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="6" style="background:#87CEFA;" | Predicted residue contacts for HRAS by EVcouplings
+
! colspan="6" style="background:#87CEFA;" | Predicted Residue Contacts for HRAS by EVcouplings
 
|-
 
|-
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
Line 103: Line 102:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5). The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings. Of the 10 residue pairs calculated by freecontact only one (Thr 87 -> Gln 129) has no actual contact when compared to the crystal structure. Within the top 10 residue pairs calculated by EVcouplings two are false positive. Gly 13 -> Ile 21 and interestingly Thr 87 -> Gln 129, the pair miss predicted as well by freecontact.</caption></small></center>
+
<center><small>'''<caption>''' Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5).<br>The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings. Of the 10 residue pairs calculated by freecontact<br>only one (Thr 87 -> Gln 129) has no actual contact when compared to the crystal structure. Within the top 10 residue pairs calculated by EVcouplings two are false positive.<br>Gly 13 -> Ile 21 and interestingly Thr 87 -> Gln 129, the pair mispredicted as well by freecontact.</caption></small></center>
 
</figtable>
 
</figtable>
   
Searching for evolutionary hotspots, the L best high-scoring residue couplings with a sequence distance greater than five, where L is the length of the aligned sequence in the multiple sequence alignment were extracted. In the case of HRAS freecontact used a sequence part of length 160 to create the couplings. The CN scores of these 160 couplings are then summed up for each amino acid. If these sums are normalized (dividing the sums by 160) they can give hints how evolutionary important the amino acid is. Performing this procedure resulted in the observation that for HRAS Phe 82, Val 81, Tyr 141, Glu 143 and Gly 115 (in descending order) seem to be the evolutionary most important residues in terms of forming and stabilizing the protein.
+
Searching for evolutionary hotspots, the L best high-scoring residue couplings with a sequence distance greater than five, where L is the length of the aligned sequence in the multiple sequence alignment were extracted. In the case of HRAS freecontact used a sequence part of length 160 to create the couplings. The CN scores of these 160 couplings are then summed up for each amino acid. If these sums are normalized (dividing the sums by 160) they can give hints on the evolutionary importance of the amino acid. Performing this procedure resulted in the observation that for HRAS Phe 82, Val 81, Tyr 141, Glu 143 and Gly 115 (in descending order) seem to be the evolutionary most important residues in terms of forming and stabilizing the protein.
   
A further possibility to predict contacts apart from freecontact is using EVcouplings. The results EVcouplings delivers are filtered the same way as the freecontact results. All scores for residue pairs that are less than five sequential positions apart are excluded. The remaining couplings are sorted after their DI score (a former version of the CN score). Comparing the top 50 DI scores from EVcouplings and CN scores from freecontact an overlap of 20 couplings can be observed. However within the 10 best pairs of each method, there is an overlap of five pairs, namely Gly 10 -> Lys 16, Ala 11 -> Asp 92, Thr 87 -> Gln 129, Phe 82 -> Tyr 141, Val 81-> Asn 116. Interestingly one of the residue pairs (Thr 87 -> Gln 129), that is predicted by both methods is a false positive. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in '''<xr id="top_20_hras"> Table </xr>'''.
+
A further possibility to predict contacts apart from freecontact is using EVcouplings. The results EVcouplings delivers are filtered the same way as the freecontact results. All scores for residue pairs that are less than five sequential positions apart are excluded. The remaining couplings are sorted after their DI score (a former version of the CN score). Comparing the top 50 DI scores from EVcouplings and CN scores from freecontact an overlap of 20 couplings can be observed. However within the 10 best pairs of each method, there is an overlap of five pairs, namely Gly 10 -> Lys 16, Ala 11 -> Asp 92, Thr 87 -> Gln 129, Phe 82 -> Tyr 141, Val 81-> Asn 116. Interestingly one of the residue pairs (Thr 87 -> Gln 129), that is predicted by both methods is a false positive. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in '''<xr id="top_20_hras"></xr>'''.
   
=== Calculation of structural models ===
+
=== Calculation of Structural Models ===
   
In order to calculate structural models for HRAS the EVfold server was used. EVfold tries to create structural models for the given protein sequence based on residue couplings that are calculated by the process described in the section above. Experience show that in most cases the best structural model is created if the number of top couplings taken is are first 60 to 70% of the protein's sequence length. To demonstrate this fact three structural models were created with the aid of the 64 best (L = 40%), 104 best (L = 65%) and 160 best (L = 100%) couplings for HRAS. Taking a look at the contact maps with the visualized predicted couplings for each length it is visible very clearly that taking the L = 65% ('''<xr id="hras_evfold_40"> Figure </xr>''') best couplings is the best trade off between predicting not enough contacts to creating a meaning full model (L = 40%) ('''<xr id="hras_evfold_65"> Figure </xr>''') and predicting to much contacts resulting in to much false positives (L = 100%) ('''<xr id="hras_evfold_100"> Figure </xr>''').
+
In order to calculate structural models for HRAS the EVfold server was used. EVfold tries to create structural models for the given protein sequence based on residue couplings that are calculated by the process described in the section above. Experience shows that in most cases the best structural model is created if the number of top couplings taken are the first 60 to 70% of the protein's sequence length. To demonstrate this fact three structural models were created with the aid of the 64 best (L = 40%), 104 best (L = 65%) and 160 best (L = 100%) couplings for HRAS. Taking a look at the contact maps with the visualized predicted couplings for each length it is visible very clearly that taking the L = 65% ('''<xr id="hras_evfold_40"></xr>''') best couplings is the best trade off between predicting not enough contacts to creating a meaning full model (L = 40%) ('''<xr id="hras_evfold_65"></xr>''') and predicting to much contacts resulting in to much false positives (L = 100%) ('''<xr id="hras_evfold_100"></xr>''').
   
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 118: Line 117:
 
|align="center"|
 
|align="center"|
 
<figure id="hras_evfold_40">
 
<figure id="hras_evfold_40">
[[Image:HRAS contact map evfold 40.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 64 (L = 40 %) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
+
[[Image:HRAS contact map evfold 40.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 64 (L = 40%) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
 
</figure>
 
</figure>
 
|align="center"|
 
|align="center"|
 
<figure id="hras_evfold_65">
 
<figure id="hras_evfold_65">
[[Image:HRAS contact map evfold 65.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 104 (L = 65%) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
+
[[Image:HRAS contact map evfold 65.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 104 (L = 65%) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
 
</figure>
 
</figure>
 
|align="center"|
 
|align="center"|
 
<figure id="hras_evfold_100">
 
<figure id="hras_evfold_100">
[[Image:HRAS contact map evfold 100.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 160 (L = 100%) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
+
[[Image:HRAS contact map evfold 100.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 160 (L = 100%) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).</caption>]]
 
</figure>
 
</figure>
 
|-
 
|-
 
|}
 
|}
   
Superimposing the five generated models for each parameter of L to 121P in pymol results in RSMDs ranging from 3. to 9.. However there is no clear correlation visible between RMSD and L. All RMSDs for all models are listed in '''<xr id="hras_rmsds"> Table </xr>'''.
+
Superimposing the five generated models for each parameter of L to 121P in Pymol results in RSMDs ranging from 10. to 16.. However there is no clear correlation visible between RMSD and L. All RMSDs for all models are listed in '''<xr id="hras_rmsds"></xr>'''.
   
 
<figtable id="hras_rmsds">
 
<figtable id="hras_rmsds">
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="4" style="background:#87CEFA;" width="500"| RMSDs calculated by superimposing models from EVfold to 121P
+
! colspan="4" style="background:#87CEFA;" width="500"| RMSDs Calculated by Superimposing Models from EVfold to 121P
 
|-
 
|-
 
! style="background:#BFBFBF;"| Model
 
! style="background:#BFBFBF;"| Model
Line 154: Line 153:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' List of RMSDs calculated by super imposing the five structural models for each L to 121P using pymol. <br> No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.</caption></small></center>
+
<center><small>'''<caption>''' List of RMSDs calculated by superimposing the five structural models for each measure of L to 121P using Pymol. <br> No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.</caption></small></center>
 
</figtable>
 
</figtable>
   
Examining the predicted models in pymol more closely (not displayed) it can be concluded that the contact prediction precision for HRAS by EVfold is good. However using this information does not necessarily yield accurate structural models.
+
Examining the predicted models in Pymol more closely (not displayed) it can be concluded that the contact prediction precision for HRAS by EVfold is good. However using this information does not necessarily yield accurate structural models.
   
 
==ASPA==
 
==ASPA==
   
=== Calculation and analysis of correlated mutations ===
+
=== Calculation and Analysis of Correlated Mutations ===
   
For ASPA the same approach as for HRAS was performed. The CN scores were calculated by freecontact and the residue couplings with a sequences distance of less than 5 were excluded. As ASPA has approximately twice as much residues as HRAS the possible contacts increase and the amount of calculate CN scores rises as well. There are how ever some interesting finds if the distribution of CN scores for ASPA (see '''<xr id="aspa_cn_distribution">Figure </xr>''') is compared to the distribution of CN scores for HRAS ('''<xr id="hras_cn_distribution">Figure </xr>'''). The overall interval of scores, including the pairs with a sequence distance below five stays roughly the same ranging from approximately -1 to +7, but excluding those the maximal CN score drops to 2.3 compared to 3.4 at HRAS. Additionally the number of high scoring pairs stays on the same magnitude despite the existence of much more possible pairings.
+
For ASPA the same approach as for HRAS was performed. The CN scores were calculated by freecontact and the residue couplings with a sequences distance of less than 5 were excluded. As ASPA has approximately twice as much residues as HRAS the possible contacts increase and the amount of calculate CN scores rises as well. However, there are some interesting finds if the distribution of CN scores for ASPA (see '''<xr id="aspa_cn_distribution"></xr>''') if compared to the distribution of CN scores for HRAS ('''<xr id="hras_cn_distribution"></xr>'''). The overall interval of scores, including the pairs with a sequence distance below five stays roughly the same ranging from approximately -1 to +7, but excluding those the maximal CN score drops to 2.3 compared to 3.4 at HRAS. Additionally the number of high scoring pairs stays on the same magnitude despite the existence of much more possible pairings.
   
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 173: Line 172:
 
|align="center"|
 
|align="center"|
 
<figure id="aspa_freecontact_contactmap">
 
<figure id="aspa_freecontact_contactmap">
[[Image:ASPA contact map freecontact.png|centre|thumb|400px|'''<caption>'''Contact map of ASPA ([http://www.pdb.org/pdb/explore/explore.do?structureId=2I3C '''2I3C''']) calculated form the crystal structure. Displayed in grey are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 10 to 301).</caption>]]
+
[[Image:ASPA contact map freecontact.png|centre|thumb|400px|'''<caption>'''Contact map of ASPA ([http://www.pdb.org/pdb/explore/explore.do?structureId=2I3C '''2I3C''']) calculated form the crystal structure. Displayed in gray are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 10 to 301).</caption>]]
 
</figure>
 
</figure>
 
|-
 
|-
 
|}
 
|}
   
Taking a closer look at the numbers 619 out of 39903 possible pairs have a CN score > 1. The filtered set contains 94 out of 38531 possible pairs. The maximal CN scores is reduced from 6.68 to 2.29. Contrary to the example of HRAS the reduction of the has a great impact on the precision. The original set contains 435 true positives and 184 false positives (precision of 0.703) while the reduced set contains 22 true positive predictions out of 94 predictions over all (precision of 0.234). This big decrease in precision could be explained with the fact that the overall prediction is not that good due to the fact that compared to the MSA of HRAS the used MSA for ASPA is ten times smaller. A visualization of the predicted contacts together with crystal structure is displayed in '''<xr id="aspa_freecontact_contactmap"> Figure </xr>'''. A overview of the top 10 predictions for ASPA in more detail is displayed in '''<xr id="top_10_aspa"> Table </xr>'''.
+
Taking a closer look at the numbers 619 out of 39903 possible pairs have a CN score > 1. The filtered set contains 94 out of 38531 possible pairs. The maximal CN scores is reduced from 6.68 to 2.29. Contrary to the example of HRAS the reduction of the set has a great impact on the precision. The original set contains 435 true positives and 184 false positives (precision of 0.703) while the reduced set contains 22 true positive predictions out of 94 predictions over all (precision of 0.234). This big decrease in precision could be explained with the fact that the overall prediction is not that good due to the fact that compared to the MSA of HRAS the used MSA for ASPA is ten times smaller. A visualization of the predicted contacts together with crystal structure is displayed in '''<xr id="aspa_freecontact_contactmap"></xr>'''. An overview of the top 10 predictions for ASPA in more detail is displayed in '''<xr id="top_10_aspa"></xr>'''.
   
 
<figtable id="top_10_aspa">
 
<figtable id="top_10_aspa">
Line 186: Line 185:
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="6" style="background:#87CEFA;" | Predicted residue contacts for ASPA by freecontact
+
! colspan="6" style="background:#87CEFA;" | Predicted Residue Contacts for ASPA by freecontact
 
|-
 
|-
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
Line 222: Line 221:
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="6" style="background:#87CEFA;" | Predicted residue contacts for ASPA by EVcouplings
+
! colspan="6" style="background:#87CEFA;" | Predicted Residue Contacts for ASPA by EVcouplings
 
|-
 
|-
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
 
! colspan=2 style="background:#BFBFBF;"| Residue #1
Line 257: Line 256:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5). The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings.</caption></small></center>
+
<center><small>'''<caption>''' Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5).<br>The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings.</caption></small></center>
 
</figtable>
 
</figtable>
   
  +
Like the search for evolutionary hot-spots in HRAS, the normalized CN scores for each residue in ASPA were calculated with L = 291. Doing this resulted in the residues His 21, Gly 22, Asn 70, Ala 57 and Arg 63 (in descending order) predicted to be the top five evolutionary hot-spots. Comparing these positions to the SNPs extracted for Task 07 (can be found in the '''[[Canavan_Disease:_Task_07_-_Supplement|Supplement]]''') shows that His 21, Asn 70, and Ala 57 are residues that have a non-synonymous SNP associated with Canavan Disease.
Searching for evolutionary hotspots, the L best high-scoring residue couplings with a sequence distance greater than five, where L is the length of the aligned sequence in the multiple sequence alignment were extracted. In the case of HRAS freecontact used a sequence part of length 160 to create the couplings. The CN scores of these 160 couplings are then summed up for each amino acid. If these sums are normalized (dividing the sums by 160) they can give hints how evolutionary important the amino acid is. Performing this procedure resulted in the observation that for HRAS Phe 82, Val 81, Tyr 141, Glu 143 and Gly 115 (in descending order) seem to be the evolutionary most important residues in terms of forming and stabilizing the protein.
 
   
A further possibility to predict contacts apart from freecontact is using EVcouplings. The results EVcouplings delivers are filtered the same way as the freecontact results. All scores for residue pairs that are less than five sequential positions apart are excluded. The remaining couplings are sorted after their DI score (a former version of the CN score). Comparing the top 50 DI scores from EVcouplings and CN scores from freecontact an overlap of 20 couplings can be observed. However within the 10 best pairs of each method, there is an overlap of five pairs, namely Gly 10 -> Lys 16, Ala 11 -> Asp 92, Thr 87 -> Gln 129, Phe 82 -> Tyr 141, Val 81-> Asn 116. Interestingly one of the residue pairs (Thr 87 -> Gln 129), that is predicted by both methods is a false positive. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in '''<xr id="top_10_aspa"> Table </xr>'''.
+
Comparing the 50 top scoring residue pairs calculated with EVcouplings to the 50 top scoring ones calculated by freecontact, the overlap is 10. The reduced percentage of overlap compared to HRAS could again be accounted to the much smaller multiple sequence alignment in which ASPA is contained. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in '''<xr id="top_10_aspa"></xr>'''.
   
=== Calculation of structural models ===
+
=== Calculation of Structural Models ===
   
  +
Like the EVfold calculations for HRAS, three different models were calculated for L = 40%, L = 65% and L = 100%. The according cut-offs are 116, 189 and 291. For each of the three models there seems to be a high number of false positive contact predictions (compare '''<xr id="aspa_evfold_40"></xr>''', '''<xr id="aspa_evfold_65"></xr>''' and, '''<xr id="aspa_evfold_100"></xr>''') and changing L does not change the overall precision.
In order to calculate structural models for HRAS the EVfold server was used. EVfold tries to create structural models for the given protein sequence based on residue couplings that are calculated by the process described in the section above. Experience show that in most cases the best structural model is created if the number of top couplings taken is are first 60 to 70% of the protein's sequence length. To demonstrate this fact three structural models were created with the aid of the 64 best (L = 40%), 104 best (L = 65%) and 160 best (L = 100%) couplings for HRAS. Taking a look at the contact maps with the visualized predicted couplings for each length it is visible very clearly that taking the L = 65% ('''<xr id="hras_evfold_40"> Figure </xr>''') best couplings is the best trade off between predicting not enough contacts to creating a meaning full model (L = 40%) ('''<xr id="hras_evfold_65"> Figure </xr>''') and predicting to much contacts resulting in to much false positives (L = 100%) ('''<xr id="hras_evfold_100"> Figure </xr>''').
 
   
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 272: Line 271:
 
|align="center"|
 
|align="center"|
 
<figure id="aspa_evfold_40">
 
<figure id="aspa_evfold_40">
[[Image:ASPA contact map evfold 40.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 116 (L = 40 %) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
+
[[Image:ASPA contact map evfold 40.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 116 (L = 40 %) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
 
</figure>
 
</figure>
 
|align="center"|
 
|align="center"|
 
<figure id="aspa_evfold_65">
 
<figure id="aspa_evfold_65">
[[Image:ASPA contact map evfold 65.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 189 (L = 65%) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
+
[[Image:ASPA contact map evfold 65.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 189 (L = 65%) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
 
</figure>
 
</figure>
 
|align="center"|
 
|align="center"|
 
<figure id="aspa_evfold_100">
 
<figure id="aspa_evfold_100">
[[Image:ASPA contact map evfold 100.png|centre|thumb|300px|'''<caption>'''Visalization of the '''best 291 (L = 100%) contacts predicted by EVfold''' as red dots. Displayed in grey are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
+
[[Image:ASPA contact map evfold 100.png|centre|thumb|300px|'''<caption>'''Visualization of the '''best 291 (L = 100%) contacts predicted by EVfold''' as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).</caption>]]
 
</figure>
 
</figure>
 
|-
 
|-
 
|}
 
|}
   
  +
The same effect can be observed if the RMSDs of the models to the crystal structure are computed ('''<xr id="aspa_rmsds"></xr>'''). Concerning the RMSDs between the different cut-offs for L there is no significant difference detectable. All RMSDs are ranging between 18.0Å and 26.4Å. The final conclusion in regard to the quality of the models is the same as with HRAS. The calculated models reflect small structural elements correctly but the overall conformation of the protein is faulty and not usable.
Superimposing the five generated models for each parameter of L to 121P in pymol results in RSMDs ranging from 3.7Å to 9.1Å. However there is no clear correlation visible between RMSD and L. All RMSDs for all models are listed in '''<xr id="hras_rmsds"> Table </xr>'''. Examining the predicted models in pymol more closely (not displayed) it can be concluded that the contact prediction precision for HRAS by EVfold is good. However using this information does not necessarily yield accurate structural models.
 
   
 
<figtable id="aspa_rmsds">
 
<figtable id="aspa_rmsds">
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center"
 
|-
 
|-
! colspan="4" style="background:#87CEFA;" width="500"| RMSDs calculated by superimposing models from EVfold to 121P
+
! colspan="4" style="background:#87CEFA;" width="500"| RMSDs Calculated by Superimposing Models from EVfold to 2I3C
 
|-
 
|-
 
! style="background:#BFBFBF;"| Model
 
! style="background:#BFBFBF;"| Model
Line 308: Line 307:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' List of RMSDs calculated by super imposing the five structural models for each L to 121P using pymol. <br> No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.</caption></small></center>
+
<center><small>'''<caption>''' List of RMSDs calculated by superimposing the five structural models for each measure of L to 2I3C using Pymol. <br> No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.</caption></small></center>
 
</figtable>
 
</figtable>
   
Line 322: Line 321:
 
* Link to Task 09: [[Canavan_Disease:_Task_09_-_Structure-based_Mutation_Analysis|Structure-based Mutation Analysis]]
 
* Link to Task 09: [[Canavan_Disease:_Task_09_-_Structure-based_Mutation_Analysis|Structure-based Mutation Analysis]]
 
* Link to Task 10: [[Canavan_Disease:_Task_10_-_Normal_Mode_Analysis|Normal Mode Analysis]]
 
* Link to Task 10: [[Canavan_Disease:_Task_10_-_Normal_Mode_Analysis|Normal Mode Analysis]]
* Link to Task 11: [[Canavan_Disease:_Task_11_-_Molecular_Dynamics_Simulation|Molecular Dynamics Simulation]]
 

Latest revision as of 12:17, 5 September 2013

Protein structure prediction from evolutionary sequence variation is another approach of finding a protein structure using evolutionary couplings. So a structure can be found without using any 3D informations.

Dataset

To gain the HRAS multiple sequence alignment the instructions were followed and the full MSA provided by Pfam (PF00071) was downloaded and used for further calculations and statistics. Searching for a multiple sequence alignment for aspartoacylase (ASPA) in Pfam revealed that the two criteria to gain meaningful insights out of the calculations of freecontact, EVcouplings and EVfold, namely over 1000 sequences in the MSA and large parts of the reference sequence are contained in the MSA, are satisfied. The multiple sequence alignment for the protein family containing ASPA (PF04952) includes 2822 sequences and the region of ASPA that is used in the MSA spans from position 10 to 301 with ASPA having a total length of 313 amino acids. Hence the Pfam MSA is regarded as viable input for the following calculations.

HRAS

Calculation and Analysis of Correlated Mutations

freecontact is based upon searching conserved regions and correlated mutations in a multiple sequence alignment, to predict pairs of residues that are in contact in a protein. It is to be expected that residues that are close to each other in sequence are as well close in three dimensional space, as their contact often defines the secondary structure elements and the conformation of the protein on a small scale. Therefore residue pairs that are close in sequence are ranked with a high CN score by freecontact. However more meaningful for the overall conformation of the protein are stabilizing contacts between residues that are more distant in sequence space. This is the reason for filtering the predicted contacts to exclude residues that are distant less than five residues in sequence. Looking at the distribution of the CN scores (<xr id="hras_cn_distribution"></xr>) this gets visible as well.

</figure>

</figure>

<figure id="hras_cn_distribution">

Distribution of the CN scores for HRAS calculated by freecontact. The frequencies of the CN scores are displayed for all pairs of residues (orange) and pairs of residues more than five positions apart in the sequence (blue). Pairs with a CN score of above 1 are considered high scoring. It is visible that only a tiny fraction of the pairs are high scoring, as well as the reduction of the set to pairs of a sequence distance of more than five has a huge impact of the amount of high scoring pairs.

<figure id="hras_freecontact_contactmap">

Contact map of HRAS (121P) calculated form the crystal structure. Displayed in gray are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 5 to 165).

The first thing to be noted is that only a tiny fraction (514 out of 12561 possible pairs) has a CN score > 1, what is considered to be high scoring. If the set is reduced to residue pairs with a sequence distance greater five this subset of high scoring pairs is immediately reduced to 65 pairs. Secondly the maximal CN scores is reduced from 6.01 to 3.40. Reducing the set however has no great impact on the precision. The predicted high scoring contacts of the original set contain 439 true positives and 75 false positives (precision of 0.854) while the reduced set contains 55 true positive predictions out of 65 predictions over all (precision of 0.846). The predicted contacts are visualized together with the actual contacts calculated with the aid of the crystal structure in <xr id="hras_freecontact_contactmap"></xr>. An overview of the top 10 predictions for HRAS in more detail are displayed in <xr id="top_20_hras"></xr>.

<figtable id="top_20_hras">

Predicted Residue Contacts for HRAS by freecontact
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
11 A 92 D 3.40454 TP
81 V 116 N 2.99937 TP
87 T 129 Q 2.68523 FP
82 F 141 Y 2.52755 TP
84 I 115 G 2.52502 TP
19 L 81 V 2.50464 TP
82 F 115 G 2.41709 TP
10 G 16 K 2.26384 TP
130 A 141 Y 2.24938 TP
123 R 143 E 2.21315 TP
Predicted Residue Contacts for HRAS by EVcouplings
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
10 G 16 K 0.2058040 TP
13 G 21 I 0.1126380 FP
11 A 92 D 0.0946234 TP
87 T 129 Q 0.0766384 FP
114 V 155 A 0.0760636 TP
117 K 145 S 0.0758184 TP
82 F 141 Y 0.0757755 TP
116 N 146 A 0.0755790 TP
35 T 60 G 0.0707763 TP
81 V 116 N 0.0665576 TP
Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5).
The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings. Of the 10 residue pairs calculated by freecontact
only one (Thr 87 -> Gln 129) has no actual contact when compared to the crystal structure. Within the top 10 residue pairs calculated by EVcouplings two are false positive.
Gly 13 -> Ile 21 and interestingly Thr 87 -> Gln 129, the pair mispredicted as well by freecontact.

</figtable>

Searching for evolutionary hotspots, the L best high-scoring residue couplings with a sequence distance greater than five, where L is the length of the aligned sequence in the multiple sequence alignment were extracted. In the case of HRAS freecontact used a sequence part of length 160 to create the couplings. The CN scores of these 160 couplings are then summed up for each amino acid. If these sums are normalized (dividing the sums by 160) they can give hints on the evolutionary importance of the amino acid. Performing this procedure resulted in the observation that for HRAS Phe 82, Val 81, Tyr 141, Glu 143 and Gly 115 (in descending order) seem to be the evolutionary most important residues in terms of forming and stabilizing the protein.

A further possibility to predict contacts apart from freecontact is using EVcouplings. The results EVcouplings delivers are filtered the same way as the freecontact results. All scores for residue pairs that are less than five sequential positions apart are excluded. The remaining couplings are sorted after their DI score (a former version of the CN score). Comparing the top 50 DI scores from EVcouplings and CN scores from freecontact an overlap of 20 couplings can be observed. However within the 10 best pairs of each method, there is an overlap of five pairs, namely Gly 10 -> Lys 16, Ala 11 -> Asp 92, Thr 87 -> Gln 129, Phe 82 -> Tyr 141, Val 81-> Asn 116. Interestingly one of the residue pairs (Thr 87 -> Gln 129), that is predicted by both methods is a false positive. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in <xr id="top_20_hras"></xr>.

Calculation of Structural Models

In order to calculate structural models for HRAS the EVfold server was used. EVfold tries to create structural models for the given protein sequence based on residue couplings that are calculated by the process described in the section above. Experience shows that in most cases the best structural model is created if the number of top couplings taken are the first 60 to 70% of the protein's sequence length. To demonstrate this fact three structural models were created with the aid of the 64 best (L = 40%), 104 best (L = 65%) and 160 best (L = 100%) couplings for HRAS. Taking a look at the contact maps with the visualized predicted couplings for each length it is visible very clearly that taking the L = 65% (<xr id="hras_evfold_40"></xr>) best couplings is the best trade off between predicting not enough contacts to creating a meaning full model (L = 40%) (<xr id="hras_evfold_65"></xr>) and predicting to much contacts resulting in to much false positives (L = 100%) (<xr id="hras_evfold_100"></xr>).

</figure>

</figure>

</figure>

<figure id="hras_evfold_40">

Visualization of the best 64 (L = 40%) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).

<figure id="hras_evfold_65">

Visualization of the best 104 (L = 65%) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).

<figure id="hras_evfold_100">

Visualization of the best 160 (L = 100%) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the HRAS crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 5 to 165).

Superimposing the five generated models for each parameter of L to 121P in Pymol results in RSMDs ranging from 10.6Å to 16.7Å. However there is no clear correlation visible between RMSD and L. All RMSDs for all models are listed in <xr id="hras_rmsds"></xr>.

<figtable id="hras_rmsds">

RMSDs Calculated by Superimposing Models from EVfold to 121P
Model L = 40% L = 65% L = 100%
#1 10.6Å 15.3Å 16.7Å
#2 12.2Å 12.3Å 14.4Å
#3 12.2Å 13.7Å 13.3Å
#4 11.1Å 12.0Å 12.8Å
#5 13.2Å 13.7Å 10.7Å
List of RMSDs calculated by superimposing the five structural models for each measure of L to 121P using Pymol.
No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.

</figtable>

Examining the predicted models in Pymol more closely (not displayed) it can be concluded that the contact prediction precision for HRAS by EVfold is good. However using this information does not necessarily yield accurate structural models.

ASPA

Calculation and Analysis of Correlated Mutations

For ASPA the same approach as for HRAS was performed. The CN scores were calculated by freecontact and the residue couplings with a sequences distance of less than 5 were excluded. As ASPA has approximately twice as much residues as HRAS the possible contacts increase and the amount of calculate CN scores rises as well. However, there are some interesting finds if the distribution of CN scores for ASPA (see <xr id="aspa_cn_distribution"></xr>) if compared to the distribution of CN scores for HRAS (<xr id="hras_cn_distribution"></xr>). The overall interval of scores, including the pairs with a sequence distance below five stays roughly the same ranging from approximately -1 to +7, but excluding those the maximal CN score drops to 2.3 compared to 3.4 at HRAS. Additionally the number of high scoring pairs stays on the same magnitude despite the existence of much more possible pairings.

</figure>

</figure>

<figure id="aspa_cn_distribution">

The distribution of the CN scores for ASPA calculated by freecontact. The frequency of the CN scores are displayed for all pairs of residues (orange) and pairs of residues more than five positions apart in the sequence (blue). Pairs with a CN score of above 1 are considered high scoring. It is visible that only a tiny fraction of the pairs are high scoring, as well as the reduction of the set to pairs of a sequence distance of more than five has a huge impact of the amount of high scoring pairs.

<figure id="aspa_freecontact_contactmap">

Contact map of ASPA (2I3C) calculated form the crystal structure. Displayed in gray are the residue pairs with a distance of less than 5Å. Displayed as red dots are the contacts predicted by freecontact. Those are reduced to residue pairs that have a high CN score (cn > 1) and are more than 5 residues apart in sequence (i & i+n, where n > 5). The dashed rectangle in green visualizes the borders in which freecontact calculated CN scores (residue 10 to 301).

Taking a closer look at the numbers 619 out of 39903 possible pairs have a CN score > 1. The filtered set contains 94 out of 38531 possible pairs. The maximal CN scores is reduced from 6.68 to 2.29. Contrary to the example of HRAS the reduction of the set has a great impact on the precision. The original set contains 435 true positives and 184 false positives (precision of 0.703) while the reduced set contains 22 true positive predictions out of 94 predictions over all (precision of 0.234). This big decrease in precision could be explained with the fact that the overall prediction is not that good due to the fact that compared to the MSA of HRAS the used MSA for ASPA is ten times smaller. A visualization of the predicted contacts together with crystal structure is displayed in <xr id="aspa_freecontact_contactmap"></xr>. An overview of the top 10 predictions for ASPA in more detail is displayed in <xr id="top_10_aspa"></xr>.

<figtable id="top_10_aspa">

Predicted Residue Contacts for ASPA by freecontact
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
14 V 203 L 2.28549 FP
26 T 117 N 2.09665 FP
21 H 27 G 2.03355 FP
247 L 254 P 2.00055 FP
251 D 277 T 1.88627 FP
22 G 68 D 1.87523 FP
15 A 112 I 1.79110 TP
257 P 264 T 1.77483 TP
258 G 264 T 1.72534 TP
16 I 31 V 1.69869 TP
Predicted Residue Contacts for ASPA by EVcouplings
Residue #1 Residue #2 CN score TP/FP
Position Amino acid Position Amino acid
26 T 117 N 0.208325 FP
15 A 112 I 0.174681 TP
63 R 71 R 0.152020 FP
143 I 162 L 0.130819 FP
44 T 50 P 0.124957 FP
21 H 114 D 0.123887 FP
24 E 116 H 0.116584 TP
11 I 44 T 0.108161 FP
22 G 71 R 0.105899 FP
23 N 63 R 0.100085 TP
Overview of the top 10 residue pairs predicted to be in contact by freecontact and EVcouplings that are apart more than 5 residues in the sequence (i & i+n, where n > 5).
The residue pairs are ranked in descending order according to their CN/DI score calculated by freecontact/EVcouplings.

</figtable>

Like the search for evolutionary hot-spots in HRAS, the normalized CN scores for each residue in ASPA were calculated with L = 291. Doing this resulted in the residues His 21, Gly 22, Asn 70, Ala 57 and Arg 63 (in descending order) predicted to be the top five evolutionary hot-spots. Comparing these positions to the SNPs extracted for Task 07 (can be found in the Supplement) shows that His 21, Asn 70, and Ala 57 are residues that have a non-synonymous SNP associated with Canavan Disease.

Comparing the 50 top scoring residue pairs calculated with EVcouplings to the 50 top scoring ones calculated by freecontact, the overlap is 10. The reduced percentage of overlap compared to HRAS could again be accounted to the much smaller multiple sequence alignment in which ASPA is contained. A more detailed view of the top 10 ranked residue pairs calculated by EVcouplings can been seen in <xr id="top_10_aspa"></xr>.

Calculation of Structural Models

Like the EVfold calculations for HRAS, three different models were calculated for L = 40%, L = 65% and L = 100%. The according cut-offs are 116, 189 and 291. For each of the three models there seems to be a high number of false positive contact predictions (compare <xr id="aspa_evfold_40"></xr>, <xr id="aspa_evfold_65"></xr> and, <xr id="aspa_evfold_100"></xr>) and changing L does not change the overall precision.

</figure>

</figure>

</figure>

<figure id="aspa_evfold_40">

Visualization of the best 116 (L = 40 %) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).

<figure id="aspa_evfold_65">

Visualization of the best 189 (L = 65%) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).

<figure id="aspa_evfold_100">

Visualization of the best 291 (L = 100%) contacts predicted by EVfold as red dots. Displayed in gray are the residue pairs with a distance of less than 5Å in the ASPA crystal structure. The dashed rectangle in green visualizes the borders in which EVfold calculated DI scores (residue 11 to 180).

The same effect can be observed if the RMSDs of the models to the crystal structure are computed (<xr id="aspa_rmsds"></xr>). Concerning the RMSDs between the different cut-offs for L there is no significant difference detectable. All RMSDs are ranging between 18.0Å and 26.4Å. The final conclusion in regard to the quality of the models is the same as with HRAS. The calculated models reflect small structural elements correctly but the overall conformation of the protein is faulty and not usable.

<figtable id="aspa_rmsds">

RMSDs Calculated by Superimposing Models from EVfold to 2I3C
Model L = 40% L = 65% L = 100%
#1 23.4Å 21.8Å 18.8Å
#2 20.0Å 18.8Å 18.3Å
#3 26.4Å 22.4Å 20.0Å
#4 19.5Å 20.3Å 17.6Å
#5 19.0Å 18.0Å 19.0Å
List of RMSDs calculated by superimposing the five structural models for each measure of L to 2I3C using Pymol.
No clear correlation between the chosen L and the RMSD is visible. The models are listed according to their predicted quality.

</figtable>

Tasks