Difference between revisions of "Canavan Disease: Task 02 - Alignments"

Revision as of 16:39, 29 August 2013

Sequence Alignments for Canavan Disease are done, since it is known that it is caused by Aspartoacylase (ASPA) Deficiency, or "Aminoacylase 2 (ACY2) Deficiency" (Task 01). So in the further step sequence alignments using ASPA are done to search for related sequences.

LabJournal

Theoretical background talk

Ariane gave the introductory talk, which can be found here.

Pairwise sequence alignments

Pairwise sequence alignments help to find related sequences. In this task ACY2 also know as ASPA, which if containing a mutation, is the major cause of Canavan Disease, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. To get a good and clear overview, diminuatives and a colorscheme was used to describe and compare all methods (<xr id="Dimin">Table</xr>):

Diminuatives for Pairwise Sequence Alignment Searches
Diminutive	Encryption
Blast	Blast against big_80
HHblits	HHblits standard parameters against Uniprot
PsiBlast D2	PsiBlast using the default e-Value cutoff (0.002) with 2 iterations against big_80
PsiBlast e2	PsiBlast using the e-Value cutoff 10E-10 with 2 iterations against big_80
PsiBlast D10	PsiBlast using the default e-Value cutoff (0.002) with 10 iterations against big_80
PsiBlast e10	PsiBlast using the e-Value cutoff 10E-10 with 10 iterations against big_80
Big	PsiBlast using the e-Value cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against big

Diminuatives, parameters and color scheme of the different pairwise sequence alignment search tools

</figtable>

As it can be seen in the <xr id="Dimin">previous Table</xr>, the simple Blast search was done against big_80 as well as the different PsiBlast searches, except for the Big, which is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is simply done against an already existing database of Uniprot.

Comparison of Search Tools

The next step was evaluating the different search tools. Here the focus was on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This was done to get a better comparison between the methods, especially Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.
As it can be seen in <xr id="eVal">Figure</xr> the distribution of the e-Values found using the different methods is quite the same:

This figure shows the e-Value distribution of the different pairwise sequence alignment tools. The outliers were removed to keep track over the data.

</figure>

As mentioned above the best way to compare the methods is looking at the graphs for Blast, HHblits and Big (compare <xr id="eVal">Figure</xr>). With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity decreases as insignificant hits are incorporated into the result.

This figure shows the percent sequence identity distribution of the different pairwise sequence alignment tools

</figure>

Comparing the distribution of sequence identities for the used methods (see <xr id="seqID">Figure</xr>) it get visible, that more iterations used for the PsiBlast search result in a lower percentage sequence identity. Again, per iteration the specificity decreases with a increase of not significant hits.

<xr id="Overlap">Table</xr> represents the overlapping sequence hits between methods. The diagonal represents the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are represented.

Overlapping Sequence-Hits
	Blast	HHblits	PsiBlast D2	PsiBlast e2	PsiBlast D10	PsiBlast e10	Big
Blast	194	24 (145)	186	186	183	185	185
HHblits	24 (145)	276 (3128)	94 (719)	84 (666)	145 (1043)	143 (1062)	195 (2847)
PsiBlast D2	186	94 (719)	918	830	894	899	899
PsiBlast e2	186	84 (666)	830	838	818	824	824
PsiBlast D10	183	145 (1043)	894	818	3505	2513	2335
PsiBlast e10	185	143 (1062)	899	824	2513	2725	2432
Big	185	195 (2847)	899	824	2335	2432	6170

Number of overlapping Sequences using the different methods. The diagonal (e.g. intersection Blast Blast) corresponds to the number
of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster.

</figtable>

For a better overview <xr id="venn">Venn Diagram</xr> shows the number of sequences, which overlap between Blast, HHblits and Big:

This figure shows the number of overlapping sequences between the different pairwise sequence alignment methods Blast, HHblits and Big.

</figure>

Validation using GeneOntology

The validation using GeneOntology takes all cluster members of HHblits into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. <xr id="GOAnn">Table</xr> shows the top 10 results with a counter (how often they appeared):

GO annotations of all common proteins
Count	Annotation
141	hydrolase activity, acting on ester bonds
141	metabolic process
113	hydrolase activity
112	metal ion binding
75	zinc ion binding
69	hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides
58	aspartoacylase activity
20	arginine metabolic process
20	arginine catabolic process to glutamate
20	arginine catabolic process to succinate

Top 10 Go-Annotations among all common proteins of the pairwise sequence alignments
(Blast, HHblits, Psiblast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major
enzyme causing Canavan Disease

</figtable>

As it can be seen in <xr id="GOAnn">Table</xr>, the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase activity is the enzymatic reaction of ACY2. As aspartoacylase contains a bound zinc atom, the high occurrence of metal & zinc ion binding is to be expected. Furthermore aspartoacylase is subtype of a hydrolase. The last three high ranking GO-Terms are explainable with the fact that L-Aspartate, a chemical compound in the reaction ASPA catalyses, is a prior substrate of the arginine metabolism (Task 01).

Multiple Sequence Alignments

To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were build randomly out of the Big data, as this made is possible to take pdb-structures into account. The randomly generated sets contained of sequence identity above 60% and below 30% each contain 10 sequences and the set over the whole range contains 20. The whole range set is a mixture between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as shown in <xr id="seqID">Figure</xr>.
Concerning the secondary structure, in the above60 set and in the whole range set the same protein was used to calculate the differences (2O4H). It contains information about the active site, binding regions, metal binding sites and simple binding sites. For the below30 set finding a pdb-structure with information comparable to aspartoacylase was quite difficult. The protein Succinylglutamate desuccinylase/aspartoacylase from a bacteria species seemed to be the best for this purpose. Unfortunately for this protein only metal binding sites and simple binding sites are annotated in Uniprot. Therefore there is no information concerning active site or binding region and the conservation of those sites within the multiple sequence alignment could not be detected.

<xr id="MSA">Table</xr> should give a nice overview of the resulting multiple sequence alignments.

Comparison of Multiple Sequence Alignment Tools
	MAFFT			MUSCLE			T-COFFEE			EXPRESSO
Set	>60	<30	whole	>60	<30	whole	>60	<30	whole	>60	<30	whole
Gaps	9	532	444	9	325	325	9	922	957	18	964	910
Blocks	3	48	26	2	38	23	3	48	58	4	48	66
Avg Block Size	104.3	4.6	11.7	156.5	8.2	13.5	104.3	4.0	5.3	87.0	3.2	4.6
well conserved Columns	252	14	14	251	1	5	252	12	13	248	11	15
all Columns	313	223	304	313	312	311	313	193	310	312	152	302

Comparison of MSA tools with respect to gaps, blocks and conserved columns

</figtable>

<xr id="MSA">Table</xr> describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Additionally it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.
To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, <xr id="secstruc">Table</xr> should give an overview: <figtable id="secstruc">

Secondary Structure and Functional Residues in MSA tools
		MAFFT			MUSCLE			T-COFFEE			EXPRESSO
	Set	>60	<30	whole	>60	<30	whole	>60	<30	whole	>60	<30	whole
Gaps	gap in SecStruc	0	0	0	0	2	0	0	4	0	0	5	0
Gaps	SecStruc in gap	0	96	2	0	48	0	0	116	1	0	141	2
SecStruc	breaks	0	9	4	0	9	11	0	23	20	0	24	19
SecStruc	in helix	0	6	1	0	6	4	0	8	12	0	13	10
FuncRes	missing	-	-	-	-	-	-	-	B	-	-	-	-
	well conserved	AMBR	M	MBR	AMBR	-	R	AMBR	M	MR	AMBR	M	MBR
	less conserved	-	B	ABR	-	BM	AMBR	-	-	AMBR	-	B	ABR

Comparison of secondary structure (SecStruc) and functional residues (FuncRes) in MSA tools.
gap in secstruc - shows a gap in the secondary structure compared to consensus sequence
secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence
A - active site M - metal binding B - binding site R - binding region

</figtable>

The first part of <xr id="secstruc">Table</xr> focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a gap in the secondary structure.
The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.
At last the functinal residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active center, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region.

MAFFT

MAFFT is very fast with good alignments: Like MUSCLE it finds blocks and long consensus sequences. This is an advantage for comparing the columns with active sites, or metal binding properties (functionally important residues). These are almost all very good conserved throughout MAFFT. Like MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one.

MUSCLE

Muscle performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secundary structure breaks. This method is very promising.

TCoffee

The alignment produced by TCoffee contains the most gaps and looking at the blocks and columns found, it is comparable to MAFFT. TCoffee seems not to be able to handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. TCoffee is the only method that even misses a functional residue of a binding site.

Expresso

Instead of 3D-Coffee Expresso was used for an alignment with secundary structure information. It improves TCoffee slightly concerning the low sequence similarity searches, especially in context of conserved columns. In contrast to T-Coffee it finds at all functional residues, but the results are nearly the same.

Overall

As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set. The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better the algorithms.
All functional important sites as refered either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the bad alignments of the below30 set.
In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.

Example Alignment

To give an impression, in <xr id="msaMafft">Figure</xr> one multiple sequence alignment is displayed, showing the set above 60 percent sequence identity on MAFFT using UCSF Chimera:

This figure shows the multiple sequence alignment of the above 60% sequence identity set using MAFFT

</figure>

For reasons of clarity all other alignments can be found in the Supplement.

Supplement

Tasks

Link to Task 01: Canavan Disease
Link to Task 02: Alignments
Link to Task 03: Sequence-based Predictions
Link to Task 04: Structural Alignments
Link to Task 05: Homology Modelling
Link to Task 06: Protein Structure Prediction from Evolutionary Sequence Variation
Link to Task 07: Researching SNPs
Link to Task 08: Sequence-based Mutation Analysis
Link to Task 09: Structure-based Mutation Analysis
Link to Task 10: Normal Mode Analysis

@@ Line 385: / Line 385: @@
 </figtable>
-The first part of '''<xr id="secstruc">Table</xr>''' focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The comparison was made the following way: Take the aligned sequences and map the secondary structure onto the consensus sequence. Thus there might be gaps in the secondary structure. All methods performed very good on high sequence similarity.<br>
+The first part of '''<xr id="secstruc">Table</xr>''' focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a ''gap in the secondary structure''.<br>
-The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot.<br>
+The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.<br>
-At last the functinal residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. "A" stands for active center, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region. Overall all functional important sites have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the ''bad'' alignments of the below30 set.<br>
+At last the functinal residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active center, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region.
 ====MAFFT====
@@ Line 402: / Line 402: @@
 ====Overall====
-The overall comparison shows that the sequence identity in the sets definitely influence the alignment. As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set.<br>
+As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set.
+The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better the algorithms.<br>
+All functional important sites as refered either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the ''bad'' alignments of the below30 set.<br>
 In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.<br>

Difference between revisions of "Canavan Disease: Task 02 - Alignments"

Revision as of 16:39, 29 August 2013

Contents

LabJournal

Theoretical background talk

Pairwise sequence alignments

Comparison of Search Tools

Validation using GeneOntology

Multiple Sequence Alignments

MAFFT

MUSCLE

TCoffee

Expresso

Overall

Example Alignment

Supplement

Tasks

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools