Difference between revisions of "Canavan Disease: Task 02 - Alignments"
(→Multiple Sequence Alignments) |
|||
(260 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | The first step to gain more insight on the genetic level of aspartoacylase (ASPA) would be conducting a sequence search for similar sequences. This is done by performing '''sequence alignments''' using ASPA as template. |
||
− | Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods. |
||
− | |||
− | == Theoretical background talks == |
||
− | [[User:Boehma|Ariane]] gave the introductory talk, which can be found [https://dl.dropboxusercontent.com/u/9441182/SequenceAlignmentPresentation.pdf here]. |
||
== Pairwise sequence alignments== |
== Pairwise sequence alignments== |
||
+ | Pairwise sequence alignments help to find related sequences. In this Task ASPA that is the major cause of Canavan Disease, if it contains specific mutations, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. |
||
− | Illias - Homer |
||
+ | To get a good and clear overview, diminutives and a color scheme were used to describe and compare all methods (see '''<xr id="Dimin"></xr>'''): |
||
+ | |||
+ | <figtable id="Dimin"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="2" style="background:#87cefa;" | Diminutives for Pairwise Sequence Alignment Searches |
||
+ | |- |
||
+ | ! style="background:#bfbfbf;" align="center" | Diminutive |
||
+ | ! style="background:#bfbfbf;" align="center" | Encryption |
||
+ | |- |
||
+ | |style="background:#ff7f00;" align="center" | '''Blast''' |
||
+ | | Blast against big_80 |
||
+ | |- |
||
+ | |style="background:#00cd00;" align="center" | '''HHblits''' |
||
+ | | HHblits standard parameters against Uniprot |
||
+ | |- |
||
+ | |style="background:#00bfff;" align="center" | '''PsiBlast D2''' |
||
+ | | PsiBlast using the '''''d'''efault'' e-Value cutoff (0.002) with '''''2''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#1c86ee;" align="center" | '''PsiBlast e2''' |
||
+ | | PsiBlast using the '''''e'''-Value'' cutoff 10E-10 with '''''2''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#009acd;" align="center" | '''PsiBlast D10''' |
||
+ | | PsiBlast using the '''''d'''efault'' e-Value cutoff (0.002) with '''''10''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#104e8b;" align="center" | '''PsiBlast e10''' |
||
+ | | PsiBlast using the '''''e'''-Value'' cutoff 10E-10 with '''''10''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#9a32cd;" align="center" | '''Big''' |
||
+ | | PsiBlast using the ''e-Value'' cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against '''''big''''' |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Diminutives, parameters and color scheme of the different pairwise sequence alignment search tools. </caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | As it can be seen in '''<xr id="Dimin"></xr>''', the simple Blast search was done against big_80 as well as the different PsiBlast searches. '''Big''' is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is run with an already existing database of Uniprot. |
||
+ | |||
+ | === Comparison of Search Tools === |
||
+ | |||
+ | The next step was evaluating the different search tools. Here it was focused on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This approach was chosen to gain comparable results for Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.<br> |
||
+ | As it can be seen in '''<xr id="eVal">Figure</xr>''' the distribution of the e-Values found using the different methods is comparable: |
||
+ | |||
+ | <figure id="eVal"> |
||
+ | [[Image:CanavanEValueDistribution.png|centre|thumb|694px|'''<caption>'''e-Value distribution of different pairwise sequence alignment tools. Outliers were removed to keep track over the data.</caption>]] |
||
+ | </figure> |
||
+ | |||
+ | As mentioned above the best way to compare the methods is looking at the graphs for Blast, HHblits and Big (compare '''<xr id="eVal">Figure</xr>'''). With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity decreases as insignificant hits are incorporated into the result. |
||
+ | |||
+ | <figure id="seqID"> |
||
+ | [[Image:CanavanDistributionSeqId.png|centre|thumb|694px|'''<caption>'''Percent sequence identity distribution of different pairwise sequence alignment tools.</caption>]] |
||
+ | </figure> |
||
+ | |||
+ | Comparing the distribution of sequence identities for the used methods (see '''<xr id="seqID">Figure</xr>''') it gets visible, that more iterations used for the PsiBlast search result in a lower percentage sequence identity. Again, per iteration the specificity decreases with an increase of not significant hits. |
||
+ | |||
+ | '''<xr id="Overlap"></xr>''' represents the overlapping sequence hits between methods. The diagonal shows the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are displayed. |
||
+ | |||
+ | <figtable id="Overlap"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="9" style="background:#87cefa;" | Overlapping Sequence-Hits |
||
+ | |- |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| Blast |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| HHblits |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast D2 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast e2 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast D10 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast e10 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| Big |
||
+ | |- |
||
+ | ! style="background:#ff7f00;" align="center" | Blast |
||
+ | | align="center" |194 |
||
+ | | align="center" |24 (145) |
||
+ | | align="center" |186 |
||
+ | | align="center" |186 |
||
+ | | align="center" |183 |
||
+ | | align="center" |185 |
||
+ | | align="center" |185 |
||
+ | |- |
||
+ | ! style="background:#00cd00;" align="center" | HHblits |
||
+ | | align="center" |24 (145) |
||
+ | | align="center" |276 (3128) |
||
+ | | align="center" |94 (719) |
||
+ | | align="center" |84 (666) |
||
+ | | align="center" |145 (1043) |
||
+ | | align="center" |143 (1062) |
||
+ | | align="center" |195 (2847) |
||
+ | |||
+ | |- |
||
+ | ! style="background:#00bfff;" align="center" | PsiBlast D2 |
||
+ | | align="center" |186 |
||
+ | | align="center" |94 (719) |
||
+ | | align="center" |918 |
||
+ | | align="center" |830 |
||
+ | | align="center" |894 |
||
+ | | align="center" |899 |
||
+ | | align="center" |899 |
||
+ | |- |
||
+ | ! style="background:#1c86ee;" align="center" | PsiBlast e2 |
||
+ | | align="center" |186 |
||
+ | | align="center" |84 (666) |
||
+ | | align="center" |830 |
||
+ | | align="center" |838 |
||
+ | | align="center" |818 |
||
+ | | align="center" |824 |
||
+ | | align="center" |824 |
||
+ | |- |
||
+ | ! style="background:#009acd;" align="center" | PsiBlast D10 |
||
+ | | align="center" |183 |
||
+ | | align="center" |145 (1043) |
||
+ | | align="center" |894 |
||
+ | | align="center" |818 |
||
+ | | align="center" |3505 |
||
+ | | align="center" |2513 |
||
+ | | align="center" |2335 |
||
+ | |- |
||
+ | ! style="background:#104e8b;" align="center" | PsiBlast e10 |
||
+ | | align="center" |185 |
||
+ | | align="center" |143 (1062) |
||
+ | | align="center" |899 |
||
+ | | align="center" |824 |
||
+ | | align="center" |2513 |
||
+ | | align="center" |2725 |
||
+ | | align="center" |2432 |
||
+ | |- |
||
+ | ! style="background:#9a32cd;" align="center" | Big |
||
+ | | align="center" |185 |
||
+ | | align="center" |195 (2847) |
||
+ | | align="center" |899 |
||
+ | | align="center" |824 |
||
+ | | align="center" |2335 |
||
+ | | align="center" |2432 |
||
+ | | align="center" |6170 |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Number of overlapping sequences using different methods. The diagonal (e.g. intersection Blast Blast) corresponds to the number<br>of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster (in brackets).</caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | For a better overview '''<xr id="venn"></xr>''' shows the number of sequences, which overlap between Blast, HHblits and Big: |
||
+ | |||
+ | <figure id="venn"> |
||
+ | [[Image:Canavan_Overlapping_sequence_hits_venn-diagram.png|centre|thumb|350px|'''<caption>'''Number of overlapping sequences between different pairwise sequence alignment methods (Blast, HHblits and Big).</caption>]] |
||
+ | </figure> |
||
+ | |||
+ | === Validation using GeneOntology === |
||
+ | |||
+ | The validation using GeneOntology takes all cluster members of HHblits into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. '''<xr id="GOAnn">Table</xr>''' shows the top 10 results with a counter (how often they appeared): |
||
+ | |||
+ | <figtable id="GOAnn"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="2" style="background:#87cefa;" | GO-Annotations of all Common Proteins |
||
+ | |- |
||
+ | ! style="background:#bfbfbf;" align="center" | Count |
||
+ | ! style="background:#bfbfbf;" align="center" | Annotation |
||
+ | |- |
||
+ | | 141 |
||
+ | | hydrolase activity, acting on ester bonds |
||
+ | |- |
||
+ | | 141 |
||
+ | | metabolic process |
||
+ | |- |
||
+ | | 113 |
||
+ | | hydrolase activity |
||
+ | |- |
||
+ | | 112 |
||
+ | | metal ion binding |
||
+ | |- |
||
+ | | 75 |
||
+ | | zinc ion binding |
||
+ | |- |
||
+ | | 69 |
||
+ | | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
||
+ | |- |
||
+ | | style="background:#87cefa;" | '''58''' |
||
+ | | style="background:#87cefa;" | '''aspartoacylase activity''' |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine metabolic process |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine catabolic process to glutamate |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine catabolic process to succinate |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Top 10 GO-Annotations among all common proteins of the pairwise sequence alignments<br>(Blast, HHblits, PsiBlast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major<br>enzyme causing Canavan Disease. </caption></small></center> |
||
+ | </figtable> |
||
+ | As displayed in '''<xr id="GOAnn"></xr>''', the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase activity is the enzymatic reaction of ASPA. As aspartoacylase contains a bound zinc ion, the high occurrence of metal & zinc ion binding is to be expected. Furthermore aspartoacylase is a hydrolase subtype. The last three high ranking GO-Terms are explainable with the fact that L-Aspartate, a chemical compound in the reaction ASPA catalyses, is a prior substrate of the arginine metabolism ('''[[Canavan_Disease#Disease Mechanism|Task 01]]'''). |
||
− | === BLAST === |
||
− | maenin aeide |
||
− | === PSI-BLAST === |
||
− | thea |
||
− | ==== 2 Iterations ==== |
||
− | paelaeiadeo achilleos |
||
− | ===== default E-value cutoff (0.002) ===== |
||
− | oulomenaen |
||
− | ===== E-value cutoff 10E-10 ===== |
||
− | hae myri archaiois |
||
− | ==== 10 Iterations ==== |
||
− | algae etaeke |
||
− | ===== default E-value cutoff (0.002) ===== |
||
− | pollas diphtimous |
||
− | ===== E-value cutoff 10E-10 ===== |
||
− | psychas |
||
+ | == Multiple Sequence Alignments == |
||
− | === HHblits === |
||
+ | To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were built randomly based on the '''Big''' data, as it gave the possibility to take PDB-structures into account. The randomly generated sets of sequence identity above 60% and below 30% each contain 10 sequences. The ''whole range set'' contains 20 sequences and '''is a mixture''' between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as shown in '''<xr id="seqID"></xr>'''.<br> |
||
− | aidi proiapsen heroon<br> |
||
+ | Concerning the secondary structure, in the above60 set and in the whole range set the same protein was used to calculate the differences ([http://www.uniprot.org/uniprot/P45381 2O4H]). It contains information about the active site, binding regions, metal binding sites and simple binding sites. For the below30 set finding a PDB-structure with information comparable to aspartoacylase was quite difficult. The protein [http://www.uniprot.org/uniprot/Q088B8 Succinylglutamate desuccinylase/aspartoacylase] from a bacteria species seemed to be the best for this purpose. Unfortunately only metal binding sites and simple binding sites are annotated in Uniprot for this protein. Therefore no information concerning active site or binding region is available and the conservation of those sites within the multiple sequence alignment could not be detected. |
||
− | * Data can be found in <code>/mnt/project/pracstrucfunc13/data/</code> or <code>/mnt/project/rost_db/data/</code> |
||
+ | '''<xr id="MSA">Table</xr>''' should give a nice overview of the resulting multiple sequence alignments. |
||
− | === Notes === |
||
− | '''Note:''' Save intermediate files, e.g. a3m and hhm for the alignments and HMMs generated by HHblits, or checkpoint files for PSI-Blast. We will reuse this later. |
||
+ | <figtable id="MSA"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="13" style="background:#87cefa;" | Comparison of Multiple Sequence Alignment Tools |
||
+ | |- |
||
+ | ! style="background:#BFBFBF;" align="center" | |
||
+ | ! colspan="3" style="background:#BFBFBF;" | MAFFT |
||
+ | ! colspan="3" style="background:#BFBFBF;" | MUSCLE |
||
+ | ! colspan="3" style="background:#BFBFBF;" | T-COFFEE |
||
+ | ! colspan="3" style="background:#BFBFBF;" | EXPRESSO |
||
+ | |- |
||
+ | ! style="background:#E5E5E5;" align="center" | Set |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | |- |
||
+ | ! align="center" | Gaps |
||
+ | | align="center" | 9 |
||
+ | | align="center" | 532 |
||
+ | | align="center" | 444 |
||
+ | | align="center" | 9 |
||
+ | | align="center" | 325 |
||
+ | | align="center" | 325 |
||
+ | | align="center" | 9 |
||
+ | | align="center" | 922 |
||
+ | | align="center" | 957 |
||
+ | | align="center" | 18 |
||
+ | | align="center" | 964 |
||
+ | | align="center" | 910 |
||
+ | |- |
||
+ | ! align="center" | Blocks |
||
+ | | align="center" | 3 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 26 |
||
+ | | align="center" | 2 |
||
+ | | align="center" | 38 |
||
+ | | align="center" | 23 |
||
+ | | align="center" | 3 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 58 |
||
+ | | align="center" | 4 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 66 |
||
+ | |- |
||
+ | ! align="center" | Avg Block Size |
||
+ | | align="center" | 104.3 |
||
+ | | align="center" | 4.6 |
||
+ | | align="center" | 11.7 |
||
+ | | align="center" | 156.5 |
||
+ | | align="center" | 8.2 |
||
+ | | align="center" | 13.5 |
||
+ | | align="center" | 104.3 |
||
+ | | align="center" | 4.0 |
||
+ | | align="center" | 5.3 |
||
+ | | align="center" | 87.0 |
||
+ | | align="center" | 3.2 |
||
+ | | align="center" | 4.6 |
||
+ | |- |
||
+ | ! align="center" | Well Conserved Columns |
||
+ | | align="center" | 252 |
||
+ | | align="center" | 14 |
||
+ | | align="center" | 14 |
||
+ | | align="center" | 251 |
||
+ | | align="center" | 1 |
||
+ | | align="center" | 5 |
||
+ | | align="center" | 252 |
||
+ | | align="center" | 12 |
||
+ | | align="center" | 13 |
||
+ | | align="center" | 248 |
||
+ | | align="center" | 11 |
||
+ | | align="center" | 15 |
||
+ | |- |
||
+ | ! align="center" | All Columns |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 223 |
||
+ | | align="center" | 304 |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 312 |
||
+ | | align="center" | 311 |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 193 |
||
+ | | align="center" | 310 |
||
+ | | align="center" | 312 |
||
+ | | align="center" | 152 |
||
+ | | align="center" | 302 |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Comparison of MSA tools using sets of different sequence identity with respect to gaps, blocks and conserved columns. </caption></small></center> |
||
+ | </figtable> |
||
+ | '''<xr id="MSA">Table</xr>''' describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Additionally it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.<br> |
||
− | For '''evaluating the differences of the search methods''': |
||
+ | To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, '''<xr id="secstruc">Table</xr>''' should give an overview: |
||
− | * compare the result lists (e.g. how much overlap, distribution of %identity and E-values) |
||
+ | <figtable id="secstruc"> |
||
− | * validate the result lists -- e.g. (you do no need to do all!) |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
− | ** using CATH, SCOP or COPS (<code>/mnt/project/pracstrucfunc12/data/COPS/</code>) to check whether found pdb entries fall into the same fold class |
||
+ | |- |
||
− | ** using GO to check whether sequences have common GO classifications |
||
+ | ! colspan="15" style="background:#87cefa;" | Secondary Structure and Functional Residues in MSA Tools |
||
− | ** any other ideas how you could validate that the hits are really related? |
||
+ | |- |
||
+ | ! colspan="2" style="background:#BFBFBF;" align="center" | |
||
+ | ! colspan="3" style="background:#BFBFBF;" | MAFFT |
||
+ | ! colspan="3" style="background:#BFBFBF;" | MUSCLE |
||
+ | ! colspan="3" style="background:#BFBFBF;" | T-COFFEE |
||
+ | ! colspan="3" style="background:#BFBFBF;" | EXPRESSO |
||
+ | |- |
||
+ | ! style="background:#E5E5E5;" align="center" | Set |
||
+ | ! style="background:#E5E5E5;" align="center" | |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | |- |
||
+ | ! rowspan="2" align="center" | Gaps |
||
+ | | Gap in SecStruc |
||
+ | || 0 || 0 || 0 |
||
+ | || 0 || 2 || 0 |
||
+ | || 0 || 4 || 0 |
||
+ | || 0 || 5 || 0 |
||
+ | |- |
||
+ | | SecStruc in Gap |
||
+ | || 0 || 96 || 2 |
||
+ | || 0 || 48 || 0 |
||
+ | || 0 || 116 || 1 |
||
+ | || 0 || 141 || 2 |
||
+ | |- |
||
+ | ! rowspan="2" align="center" | SecStruc |
||
+ | | Breaks |
||
+ | || 0 || 9 || 4 |
||
+ | || 0 || 9 || 11 |
||
+ | || 0 || 23 || 20 |
||
+ | || 0 || 24 || 19 |
||
+ | |- |
||
+ | | in Helix |
||
+ | || 0 || 6 || 1 |
||
+ | || 0 || 6 || 4 |
||
+ | || 0 || 8 || 12 |
||
+ | || 0 || 13 || 10 |
||
+ | |- |
||
+ | ! rowspan="3" align="center" | FuncRes |
||
+ | | Missing |
||
+ | || - || - || - |
||
+ | || - || - || - |
||
+ | || - || B || - |
||
+ | || - || - || - |
||
+ | |- |
||
+ | | Well Conserved |
||
+ | || AMBR || M || MBR |
||
+ | || AMBR || - || R |
||
+ | || AMBR || M || MR |
||
+ | || AMBR || M || MBR |
||
+ | |- |
||
+ | | Less Conserved |
||
+ | || - || B || ABR |
||
+ | || - || BM || AMBR |
||
+ | || - || - || AMBR |
||
+ | || - || B || ABR |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Comparison of secondary structure (SecStruc) and functional residues (FuncRes) in MSA tools.<br>gap in secstruc - shows a gap in the secondary structure compared to consensus sequence <br> secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence.<br> A=active site M=metal binding B=binding site R=binding region</caption></small></center> |
||
+ | </figtable> |
||
+ | The first part of '''<xr id="secstruc">Table</xr>''' focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a ''gap in the secondary structure''.<br> |
||
+ | The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.<br> |
||
+ | At last the functional residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active site, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region. |
||
+ | ====MAFFT==== |
||
− | '''Note:''' When we do comparisons, the data needs to be comparable. Therefore: |
||
+ | '''[http://www.ebi.ac.uk/Tools/msa/mafft/ MAFFT]''' is a very fast method producing good alignments: Like MUSCLE it finds blocks and long consensus sequences. This is an advantage for comparing the columns with active sites, or metal binding properties (functional important residues). These are almost all very good conserved throughout MAFFT. Like MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one. |
||
− | * Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits (Think about why we ask you to do this.) |
||
− | *And of course: If your PSI-Blast and HHblits searches hit the limit (and your blast search didn't), also increase the number of reported hits! |
||
− | '''CAVE''': If your data set gets large, the PSI-Blast searches will take a while. |
||
+ | ====MUSCLE==== |
||
− | '''Note:''' There are a few catches that arise from the differences in how the tools operate: |
||
+ | '''[http://www.ebi.ac.uk/Tools/msa/muscle/ MUSCLE]''' performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secondary structure breaks. This method is very promising. |
||
− | * HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members. |
||
− | ** If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80. |
||
− | ** If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80. |
||
− | * Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search. -- Think about why we don't ask you to start out with a search against big.) |
||
− | * big_80 is generated with [http://weizhong-lab.ucsd.edu/cd-hit/ CD-HIT], which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low. Likewise, the Uniprot database for hhblits does not contain pdb structures. So, if you want to do the quality check using structure data, come up with a way to generate comparable results. |
||
− | * To get all hits in pdb with HHblits (not just clustered hits), you can also use the pdb_full database. |
||
+ | ====T-COFFEE==== |
||
− | == Multiple sequence alignments== |
||
+ | The alignment produced by '''[http://tcoffee.crg.cat/apps/tcoffee/do:regular T-COFFEE]''' contains the most gaps and looking at the blocks and columns found, it is comparable to MAFFT. T-COFFEE seems not to be able to handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. T-COFFEE is the only method that even misses a functional residue of a binding site. |
||
− | autous de eloria |
||
− | === Clustal W === |
||
− | euche kynessin |
||
− | === Muscle === |
||
− | oionoisi te pasi |
||
− | === T-Coffee === |
||
− | dios deteleieto boulae |
||
− | ==== |
+ | ====EXPRESSO==== |
+ | Instead of 3D-COFFEE '''[http://tcoffee.crg.cat/apps/tcoffee/do:expresso EXPRESSO]''' was used for an alignment with secondary structure information. It improves T-COFFEE slightly concerning the low sequence similarity searches, especially in context of conserved columns. In contrast to T-COFFEE it finds at all functional residues, but the results are nearly the same. |
||
− | ex hou de ta prota |
||
− | ==== 3D- Coffee ==== |
||
− | diastaetaen erisante atreides te anax andron kai dios achilleus. |
||
− | === |
+ | ====Overall==== |
+ | As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set. |
||
− | For calculating multiple sequence alignments, create a dataset of diverse sequences. |
||
+ | The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better did the algorithms perform.<br> |
||
− | Generate groups of 10 (20 for the third group) sequences where |
||
+ | All functional important sites as referred either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the "bad" alignments of the below30 set.<br> |
||
− | * one contains only sequences with low sequence identity (<30%) (also low mutual similarity!) |
||
+ | In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.<br> |
||
− | * one contains only sequences with high sequence identity (>60%) |
||
− | * one contains sequences covering the whole range of sequence identity. |
||
− | Ideally there should be at least two sequences with pdb-structures in each group. You can use the structure of your target sequence as a second structure for 3D-Coffee. |
||
+ | ====Example Alignment==== |
||
− | The alignment methods to use on each of these groups are: |
||
+ | To give an impression, in '''<xr id="msaMafft">Figure</xr>''' one multiple sequence alignment is displayed, showing the set above 60 percent sequence identity on MAFFT using [http://www.cgl.ucsf.edu/chimera/ '''UCSF Chimera''']: |
||
− | * ClustalW |
||
− | * Muscle |
||
− | * T-Coffee with |
||
− | ** default parameters ("t_coffee your_sequences.fasta) |
||
− | ** use of 3D-Coffee |
||
− | '''Note''': |
||
− | * ClustalW should be on your path on the student machines, there is a version of T-Coffee on <code>/mnt/opt/T-Coffee/bin/</code>. If you include that in your path, you also have muscle. |
||
− | * You may also use MAFFT if you like. Probably you should then leave out one of the other methods. ;-) |
||
+ | <figure id="msaMafft"> |
||
+ | [[Image:CanavanMafftAbove60Alignment.png|centre|thumb|694px|'''<caption>'''Multiple sequence alignment of the above 60% sequence identity set using '''MAFFT'''.</caption>]] |
||
+ | </figure> |
||
+ | For reasons of clarity other alignments can be found in the '''[[Canavan_Disease:_Task_02_-_Alignments#Supplement|Supplement]]'''. |
||
− | Compare your alignments (qualitatively! You do not need to run statistics!). Things to look for are: |
||
− | * How many conserved columns? |
||
− | * How many gaps? |
||
− | * Are functionally important residues conserved? |
||
− | * Are there gaps in secondary structure elements? |
||
− | * Where do functionally important residues stand out most? |
||
+ | == [[Canavan_Disease:_Task_02_-_Supplement|Supplement]] == |
||
+ | == Tasks == |
||
− | Points for discussion: |
||
+ | * Link to Task 01: [[Canavan_Disease|Canavan Disease]] |
||
− | * Observe how the sequence identity in the groups of sequences influences the alignments. |
||
+ | * Link to Task 02: [[Canavan_Disease:_Task_02_-_Alignments|Alignments]] |
||
− | ** Do all methods cope with low similarity? |
||
+ | * Link to Task 03: [[Canavan_Disease:_Task_03_-_Sequence-based_Predictions|Sequence-based Predictions]] |
||
− | ** Are residues that are aligned in the high similarity group still aligned when the low similarity sequences are added? |
||
+ | * Link to Task 04: [[Canavan_Disease:_Task_04_-_Structural_Alignments|Structural Alignments]] |
||
− | * Does the incorporation of structural information (3D Coffee) help? |
||
+ | * Link to Task 05: [[Canavan_Disease:_Task_05_-_Homology_Modelling|Homology Modelling]] |
||
− | ** Does it make a difference how many structures you include? |
||
+ | * Link to Task 06: [[Canavan_Disease:_Task_06_-_Protein_Structure_Prediction|Protein Structure Prediction from Evolutionary Sequence Variation]] |
||
− | * Overall, what would be your criteria for a good alignment? |
||
+ | * Link to Task 07: [[Canavan_Disease:_Task_07_-_Researching_SNPs|Researching SNPs]] |
||
− | * Based on your experience, which method would you like to use in the future? |
||
+ | * Link to Task 08: [[Canavan_Disease:_Task_08_-_Sequence-based_Mutation_Analysis|Sequence-based Mutation Analysis]] |
||
+ | * Link to Task 09: [[Canavan_Disease:_Task_09_-_Structure-based_Mutation_Analysis|Structure-based Mutation Analysis]] |
||
+ | * Link to Task 10: [[Canavan_Disease:_Task_10_-_Normal_Mode_Analysis|Normal Mode Analysis]] |
Latest revision as of 11:46, 5 September 2013
The first step to gain more insight on the genetic level of aspartoacylase (ASPA) would be conducting a sequence search for similar sequences. This is done by performing sequence alignments using ASPA as template.
Contents
Pairwise sequence alignments
Pairwise sequence alignments help to find related sequences. In this Task ASPA that is the major cause of Canavan Disease, if it contains specific mutations, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. To get a good and clear overview, diminutives and a color scheme were used to describe and compare all methods (see <xr id="Dimin"></xr>):
<figtable id="Dimin">
Diminutives for Pairwise Sequence Alignment Searches | |
---|---|
Diminutive | Encryption |
Blast | Blast against big_80 |
HHblits | HHblits standard parameters against Uniprot |
PsiBlast D2 | PsiBlast using the default e-Value cutoff (0.002) with 2 iterations against big_80 |
PsiBlast e2 | PsiBlast using the e-Value cutoff 10E-10 with 2 iterations against big_80 |
PsiBlast D10 | PsiBlast using the default e-Value cutoff (0.002) with 10 iterations against big_80 |
PsiBlast e10 | PsiBlast using the e-Value cutoff 10E-10 with 10 iterations against big_80 |
Big | PsiBlast using the e-Value cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against big |
</figtable>
As it can be seen in <xr id="Dimin"></xr>, the simple Blast search was done against big_80 as well as the different PsiBlast searches. Big is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is run with an already existing database of Uniprot.
Comparison of Search Tools
The next step was evaluating the different search tools. Here it was focused on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This approach was chosen to gain comparable results for Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.
As it can be seen in <xr id="eVal">Figure</xr> the distribution of the e-Values found using the different methods is comparable:
<figure id="eVal">
</figure>
As mentioned above the best way to compare the methods is looking at the graphs for Blast, HHblits and Big (compare <xr id="eVal">Figure</xr>). With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity decreases as insignificant hits are incorporated into the result.
<figure id="seqID">
</figure>
Comparing the distribution of sequence identities for the used methods (see <xr id="seqID">Figure</xr>) it gets visible, that more iterations used for the PsiBlast search result in a lower percentage sequence identity. Again, per iteration the specificity decreases with an increase of not significant hits.
<xr id="Overlap"></xr> represents the overlapping sequence hits between methods. The diagonal shows the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are displayed.
<figtable id="Overlap">
Overlapping Sequence-Hits | ||||||||
---|---|---|---|---|---|---|---|---|
Blast | HHblits | PsiBlast D2 | PsiBlast e2 | PsiBlast D10 | PsiBlast e10 | Big | ||
Blast | 194 | 24 (145) | 186 | 186 | 183 | 185 | 185 | |
HHblits | 24 (145) | 276 (3128) | 94 (719) | 84 (666) | 145 (1043) | 143 (1062) | 195 (2847) | |
PsiBlast D2 | 186 | 94 (719) | 918 | 830 | 894 | 899 | 899 | |
PsiBlast e2 | 186 | 84 (666) | 830 | 838 | 818 | 824 | 824 | |
PsiBlast D10 | 183 | 145 (1043) | 894 | 818 | 3505 | 2513 | 2335 | |
PsiBlast e10 | 185 | 143 (1062) | 899 | 824 | 2513 | 2725 | 2432 | |
Big | 185 | 195 (2847) | 899 | 824 | 2335 | 2432 | 6170 |
of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster (in brackets).
</figtable>
For a better overview <xr id="venn"></xr> shows the number of sequences, which overlap between Blast, HHblits and Big:
<figure id="venn">
</figure>
Validation using GeneOntology
The validation using GeneOntology takes all cluster members of HHblits into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. <xr id="GOAnn">Table</xr> shows the top 10 results with a counter (how often they appeared):
<figtable id="GOAnn">
GO-Annotations of all Common Proteins | |
---|---|
Count | Annotation |
141 | hydrolase activity, acting on ester bonds |
141 | metabolic process |
113 | hydrolase activity |
112 | metal ion binding |
75 | zinc ion binding |
69 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
58 | aspartoacylase activity |
20 | arginine metabolic process |
20 | arginine catabolic process to glutamate |
20 | arginine catabolic process to succinate |
(Blast, HHblits, PsiBlast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major
enzyme causing Canavan Disease.
</figtable>
As displayed in <xr id="GOAnn"></xr>, the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase activity is the enzymatic reaction of ASPA. As aspartoacylase contains a bound zinc ion, the high occurrence of metal & zinc ion binding is to be expected. Furthermore aspartoacylase is a hydrolase subtype. The last three high ranking GO-Terms are explainable with the fact that L-Aspartate, a chemical compound in the reaction ASPA catalyses, is a prior substrate of the arginine metabolism (Task 01).
Multiple Sequence Alignments
To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were built randomly based on the Big data, as it gave the possibility to take PDB-structures into account. The randomly generated sets of sequence identity above 60% and below 30% each contain 10 sequences. The whole range set contains 20 sequences and is a mixture between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as shown in <xr id="seqID"></xr>.
Concerning the secondary structure, in the above60 set and in the whole range set the same protein was used to calculate the differences (2O4H). It contains information about the active site, binding regions, metal binding sites and simple binding sites. For the below30 set finding a PDB-structure with information comparable to aspartoacylase was quite difficult. The protein Succinylglutamate desuccinylase/aspartoacylase from a bacteria species seemed to be the best for this purpose. Unfortunately only metal binding sites and simple binding sites are annotated in Uniprot for this protein. Therefore no information concerning active site or binding region is available and the conservation of those sites within the multiple sequence alignment could not be detected.
<xr id="MSA">Table</xr> should give a nice overview of the resulting multiple sequence alignments.
<figtable id="MSA">
Comparison of Multiple Sequence Alignment Tools | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole |
Gaps | 9 | 532 | 444 | 9 | 325 | 325 | 9 | 922 | 957 | 18 | 964 | 910 |
Blocks | 3 | 48 | 26 | 2 | 38 | 23 | 3 | 48 | 58 | 4 | 48 | 66 |
Avg Block Size | 104.3 | 4.6 | 11.7 | 156.5 | 8.2 | 13.5 | 104.3 | 4.0 | 5.3 | 87.0 | 3.2 | 4.6 |
Well Conserved Columns | 252 | 14 | 14 | 251 | 1 | 5 | 252 | 12 | 13 | 248 | 11 | 15 |
All Columns | 313 | 223 | 304 | 313 | 312 | 311 | 313 | 193 | 310 | 312 | 152 | 302 |
</figtable>
<xr id="MSA">Table</xr> describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Additionally it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.
To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, <xr id="secstruc">Table</xr> should give an overview:
<figtable id="secstruc">
Secondary Structure and Functional Residues in MSA Tools | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | ||
Gaps | Gap in SecStruc | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 5 | 0 | |
SecStruc in Gap | 0 | 96 | 2 | 0 | 48 | 0 | 0 | 116 | 1 | 0 | 141 | 2 | ||
SecStruc | Breaks | 0 | 9 | 4 | 0 | 9 | 11 | 0 | 23 | 20 | 0 | 24 | 19 | |
in Helix | 0 | 6 | 1 | 0 | 6 | 4 | 0 | 8 | 12 | 0 | 13 | 10 | ||
FuncRes | Missing | - | - | - | - | - | - | - | B | - | - | - | - | |
Well Conserved | AMBR | M | MBR | AMBR | - | R | AMBR | M | MR | AMBR | M | MBR | ||
Less Conserved | - | B | ABR | - | BM | AMBR | - | - | AMBR | - | B | ABR |
gap in secstruc - shows a gap in the secondary structure compared to consensus sequence
secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence.
A=active site M=metal binding B=binding site R=binding region
</figtable>
The first part of <xr id="secstruc">Table</xr> focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a gap in the secondary structure.
The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.
At last the functional residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active site, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region.
MAFFT
MAFFT is a very fast method producing good alignments: Like MUSCLE it finds blocks and long consensus sequences. This is an advantage for comparing the columns with active sites, or metal binding properties (functional important residues). These are almost all very good conserved throughout MAFFT. Like MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one.
MUSCLE
MUSCLE performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secondary structure breaks. This method is very promising.
T-COFFEE
The alignment produced by T-COFFEE contains the most gaps and looking at the blocks and columns found, it is comparable to MAFFT. T-COFFEE seems not to be able to handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. T-COFFEE is the only method that even misses a functional residue of a binding site.
EXPRESSO
Instead of 3D-COFFEE EXPRESSO was used for an alignment with secondary structure information. It improves T-COFFEE slightly concerning the low sequence similarity searches, especially in context of conserved columns. In contrast to T-COFFEE it finds at all functional residues, but the results are nearly the same.
Overall
As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set.
The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better did the algorithms perform.
All functional important sites as referred either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the "bad" alignments of the below30 set.
In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.
Example Alignment
To give an impression, in <xr id="msaMafft">Figure</xr> one multiple sequence alignment is displayed, showing the set above 60 percent sequence identity on MAFFT using UCSF Chimera:
<figure id="msaMafft">
</figure>
For reasons of clarity other alignments can be found in the Supplement.
Supplement
Tasks
- Link to Task 01: Canavan Disease
- Link to Task 02: Alignments
- Link to Task 03: Sequence-based Predictions
- Link to Task 04: Structural Alignments
- Link to Task 05: Homology Modelling
- Link to Task 06: Protein Structure Prediction from Evolutionary Sequence Variation
- Link to Task 07: Researching SNPs
- Link to Task 08: Sequence-based Mutation Analysis
- Link to Task 09: Structure-based Mutation Analysis
- Link to Task 10: Normal Mode Analysis