Canavan Disease: Task 02 - Alignments
Sequence Alignments for Canavan Disease are done, since it is known that it is caused by Aspartoacylase (ASPA) Deficiency, or "Aminoacylase 2 (ACY2) Deficiency" (Task 01). So in the further step sequence alignments using ASPA are done to search for related sequences.
Contents
LabJournal
Theoretical background talk
Ariane gave the introductory talk, which can be found here.
Pairwise sequence alignments
Pairwise sequence alignments help to find related sequences. In this task ACY2 also know as ASPA, which if containing a mutation, is the major cause of Canavan Disease, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. To get a good and clear overview, diminuatives and a colorscheme was used to describe and compare all methods (<xr id="Dimin">Table</xr>):
<figtable id="Dimin">
Diminuatives for Pairwise Sequence Alignment Searches | |
---|---|
Diminutive | Encryption |
Blast | Blast against big_80 |
HHblits | HHblits standard parameters against Uniprot |
PsiBlast D2 | PsiBlast using the default e-Value cutoff (0.002) with 2 iterations against big_80 |
PsiBlast e2 | PsiBlast using the e-Value cutoff 10E-10 with 2 iterations against big_80 |
PsiBlast D10 | PsiBlast using the default e-Value cutoff (0.002) with 10 iterations against big_80 |
PsiBlast e10 | PsiBlast using the e-Value cutoff 10E-10 with 10 iterations against big_80 |
Big | PsiBlast using the e-Value cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against big |
</figtable>
As it can be seen in the <xr id="Dimin">previous Table</xr>, the simple Blast search was done against big_80 as well as the different PsiBlast searches, except for the Big, which is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is simply done against an already existing database of Uniprot.
Comparison of Search Tools
The next step was evaluating the different search tools. Here the focus was on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This was done to get a better comparison between the methods, especially Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.
As it can be seen in <xr id="eVal">Figure</xr> the distribution of the e-Values found using the different methods is quite the same:
<figure id="eVal">
</figure>
As mentioned above the best way to compare the methods in the <xr id="eVal">Figure</xr> above is looking at the graphs for Blast, HHblits and Big. With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity and therefore the really significant eValues are getting lost.
The next <xr id="seqID">Figure</xr> shows the distribution of the percentage sequence identities:
<figure id="seqID">
</figure>
In the <xr id="seqID">Figure</xr> above can be seen, the higher the number of iterations in the PsiBlast search the lower the percentage sequence identity. Again, per iteration the specificity and therefore the really significant hits are getting lost.
The following <xr id="Overlap">Table</xr> represents the overlapping sequence hits from each method to the other. The diagonal represents the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are represented.
<figtable id="Overlap">
Overlapping Sequence-Hits | ||||||||
---|---|---|---|---|---|---|---|---|
Blast | HHblits | PsiBlast D2 | PsiBlast e2 | PsiBlast D10 | PsiBlast e10 | Big | ||
Blast | 194 | 24 (145) | 186 | 186 | 183 | 185 | 185 | |
HHblits | 24 (145) | 276 (3128) | 94 (719) | 84 (666) | 145 (1043) | 143 (1062) | 195 (2847) | |
PsiBlast D2 | 186 | 94 (719) | 918 | 830 | 894 | 899 | 899 | |
PsiBlast e2 | 186 | 84 (666) | 830 | 838 | 818 | 824 | 824 | |
PsiBlast D10 | 183 | 145 (1043) | 894 | 818 | 3505 | 2513 | 2335 | |
PsiBlast e10 | 185 | 143 (1062) | 899 | 824 | 2513 | 2725 | 2432 | |
Big | 185 | 195 (2847) | 899 | 824 | 2335 | 2432 | 6170 |
of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster.
</figtable>
For a better overview the following <xr id="venn">Venn Diagram</xr> shows the number of sequences, which overlap between Blast, HHblits and Big:
<figure id="venn">
</figure>
Validation using GeneOntology
In the validation using GeneOntology all cluster members of HHblits were also taken into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. The next <xr id="GOAnn">Table</xr> shows the top 10 results with a counter (how often they appeared):
<figtable id="GOAnn">
GO annotations of all common proteins | |
---|---|
Count | Annotation |
141 | hydrolase activity, acting on ester bonds |
141 | metabolic process |
113 | hydrolase activity |
112 | metal ion binding |
75 | zinc ion binding |
69 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
58 | aspartoacylase activity |
20 | arginine metabolic process |
20 | arginine catabolic process to glutamate |
20 | arginine catabolic process to succinate |
(Blast, HHblits, Psiblast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major
enzyme causing Canavan Disease
</figtable>
As it can be seen in the <xr id="GOAnn">Table</xr> above, the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase is the major cause of Canavan Disease. Since aspartoacylase contains a zinc atom, it is no wonder that zinc / metal ion binding can be found, too. Furthermore aspartoacylase is a hydrolase. To explain the last three: The important L-Aspartate in the hydrolase is a prior substrate of the arginine metabolism (Task 01).
Multiple Sequence Alignments
To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were build randomly out of the Big data, because this brought the opportunity to take pdb-structures into account. The randomly generated sets contained 10 sequences per sequence identity above 60% and below 30% and 20 sequences over the whole range. The whole range set is a mixture between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as in <xr id="seqID">Figure</xr>.
The following <xr id="MSA">Table</xr> should give a nice overview of the resulting multiple sequence alignments.
<figtable id="MSA">
Comparison of Multiple Sequence Alignment Tools | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole |
Gaps | 9 | 532 | 444 | 9 | 325 | 325 | 9 | 922 | 957 | 18 | 964 | 910 |
Blocks | 3 | 48 | 26 | 2 | 38 | 23 | 3 | 48 | 58 | 4 | 48 | 66 |
Avg Block Size | 104.3 | 4.6 | 11.7 | 156.5 | 8.2 | 13.5 | 104.3 | 4.0 | 5.3 | 87.0 | 3.2 | 4.6 |
well conserved Columns | 252 | 14 | 14 | 251 | 1 | 5 | 252 | 12 | 13 | 248 | 11 | 15 |
all Columns | 313 | 223 | 304 | 313 | 312 | 311 | 313 | 193 | 310 | 312 | 152 | 302 |
</figtable>
The <xr id="MSA">Table</xr> describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Second it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.
To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, the following <xr id="secstruc">Table</xr> should give an overview:
<figtable id="secstruc">
Secondary Structure and Functional Residues in MSA tools | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | ||
Gaps | gap in SecStruc | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 5 | 0 | |
SecStruc in gap | 0 | 96 | 2 | 0 | 48 | 0 | 0 | 116 | 1 | 0 | 141 | 2 | ||
SecStruc | breaks | 0 | 9 | 4 | 0 | 9 | 11 | 0 | 23 | 20 | 0 | 24 | 19 | |
in helix | 0 | 6 | 1 | 0 | 6 | 4 | 0 | 8 | 12 | 0 | 13 | 10 | ||
FuncRes | missing | - | - | - | - | - | - | - | B | - | - | - | - | |
conserved | AMBR | M | MBR | AMBR | - | R | AMBR | M | MR | AMBR | M | MBR | ||
less conserved | - | B | ABR | - | BM | AMBR | - | - | AMBR | - | B | ABR |
gap in secstruc - shows a gap in the secondary structure compared to consensus sequence
secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence
A - active site M - metal binding B - binding site R - binding region
</figtable>
The first part of the <xr id="secstruc">Table</xr> focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). Since the comparison was made the following way: Take the aligned sequence and map the secondary structure onto it, there are gaps in the secondary structure. Along all methods they performed very good on high sequence similarity.
The second part describes the breaks in the secondary structure elements. Breaks between loop regions were not taken into account. The reference sequence for the secondary structure elements in the below30 set has no active site.
At last the functinal residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. "A" stands for active center, "M" for metal binding (in this case Zinc), "B" binding site and "R" for binding region. Overall all functional important sites has been found quite well. The active site in the whole range set is found, but in lesser conservation. This may be due to the bad alignments of the below30 set.
MAFFT
MAFFT is very fast with good alignments: As MUSCLE it can find blocks and long consensus sequences. This brings an advantage comparing the columns with active sites, or metal binding properties (functionally important residues). These are almost all very good conserved throughout MAFFT. As MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one.
MUSCLE
Muscle performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secundary structure breaks. This method is very promising.
TCoffee
It has most gaps and looking at the blocks and columns found, it is comparable to MAFFT. TCoffee seems to cannot handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. TCoffee is the only method that even misses a functional residue of a binding site.
Expresso
Instead of 3D-Coffee Expresso was used for an alignment with secundary structure information. It makes TCoffee slightly better concerning the low sequence similarity searches. Especially in context of conserved columns. In contrast to T-Coffee it finds at least all functional residues, but the results are nearly the same.
Overall
The overall comparison shows that the sequence identity in the sets definitely influence the alignment. Since there was no big difference between the below 30% and the whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set.
In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.
Example Alignment
All alignments can be found in the directory. To give an impression, <xr id="msaMafft">here</xr> is one multiple sequence alignment, showing the set above 60 percent sequence identity on MAFFT using UCSF Chimera:
<figure id="msaMafft">
</figure>
All other alignments can be found in the Supplement to keep track over the really interesting data.
Supplement
Tasks
- Link to Task 01: Canavan Disease
- Link to Task 02: Alignments
- Link to Task 03: Sequence-based Predictions
- Link to Task 04: Structural Alignments
- Link to Task 05: Homology Modelling
- Link to Task 06: Protein Structure Prediction from Evolutionary Sequence Variation
- Link to Task 07: Researching SNPs
- Link to Task 08: Sequence-based Mutation Analysis
- Link to Task 09: Structure-based Mutation Analysis
- Link to Task 10: Normal Mode Analysis