Difference between revisions of "Canavan Disease: Task 02 - Alignments"
(→Multiple Sequence Alignments) |
|||
(16 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | The first step to gain more insight on the genetic level of aspartoacylase (ASPA) would be conducting a sequence search for similar sequences. This is done by performing '''sequence alignments''' using ASPA as template. |
||
− | <figure id="ASPA"> |
||
− | [[Image:ASPA_CANAVAN_2O4H.png|thumb|450px|'''<caption>'''Crystal structure of aspartoacylase (2O4H - PDB-file).</caption>]] |
||
− | </figure> |
||
− | '''Canavan Disease''' ([http://apps.who.int/classifications/icd10/browse/2010/en#/E75.2 ICD-10 E75.2]) is an autosomal recessive disorder, in which a dysfunctional enzyme causes severe brain damage. It is also known under a variety of other names describing the chemical basis or phenotype of the disease. Examples are "Spongy Degeneration Of Central Nervous System", "Aspartoacylase (ASPA) Deficiency", or "Aminoacylase 2 (ACY2) Deficiency"[http://omim.org/entry/271900]. The trivial name, Canavan Disease, originates from the name of Myrtelle Canavan (1879 – 1953)[http://en.wikipedia.org/wiki/Myrtelle_Canavan], an American physician, who first described the disease in 1931. |
||
− | There is no cure and almost all patients die within the first decade of their life. The mild / juvenile type is less severe. The treatment is based on the symptoms and is supportive. |
||
+ | == Pairwise sequence alignments== |
||
− | == Inheritance == |
||
− | Canavan Disease is an autosomal recessive genetic defect of the ASPA (aspartoacyclase) gene on chromosome 17 (for the crystal structure of the ASPA protein see '''<xr id="ASPA">Figure</xr>'''). With this pattern of heritage a newborn of a couple where both parents are carriers of the defective genome has a 25% chance neither being born suffering from Canavan Disease nor being born a carrier. For some time children born of Ashkenazi Jewish ancestry had a higher prevalence of having Canavan Disease while in the last years this prevalence is sinking due to ongoing prenatal screening programs. Other ethnic groups where Canavan Disease has a higher penetrance are for example populations of Saudi Arabian ancestry. <br> |
||
− | According to [http://ghr.nlm.nih.gov/condition/canavan-disease ''Genetics Home''] about one in 6400 to 13500 of the Ashkenazi Jewish are affected. No further information about prevalences in other populations was found. However the different populations have also different frequencies regarding the mutation they are based on. For further information see section [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Canavan_Disease#Disease_Causing_Mutations ''Disease Causing Mutations'']. |
||
+ | Pairwise sequence alignments help to find related sequences. In this Task ASPA that is the major cause of Canavan Disease, if it contains specific mutations, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. |
||
− | == Phenotype == |
||
+ | To get a good and clear overview, diminutives and a color scheme were used to describe and compare all methods (see '''<xr id="Dimin"></xr>'''): |
||
+ | |||
+ | <figtable id="Dimin"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="2" style="background:#87cefa;" | Diminutives for Pairwise Sequence Alignment Searches |
||
+ | |- |
||
+ | ! style="background:#bfbfbf;" align="center" | Diminutive |
||
+ | ! style="background:#bfbfbf;" align="center" | Encryption |
||
+ | |- |
||
+ | |style="background:#ff7f00;" align="center" | '''Blast''' |
||
+ | | Blast against big_80 |
||
+ | |- |
||
+ | |style="background:#00cd00;" align="center" | '''HHblits''' |
||
+ | | HHblits standard parameters against Uniprot |
||
+ | |- |
||
+ | |style="background:#00bfff;" align="center" | '''PsiBlast D2''' |
||
+ | | PsiBlast using the '''''d'''efault'' e-Value cutoff (0.002) with '''''2''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#1c86ee;" align="center" | '''PsiBlast e2''' |
||
+ | | PsiBlast using the '''''e'''-Value'' cutoff 10E-10 with '''''2''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#009acd;" align="center" | '''PsiBlast D10''' |
||
+ | | PsiBlast using the '''''d'''efault'' e-Value cutoff (0.002) with '''''10''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#104e8b;" align="center" | '''PsiBlast e10''' |
||
+ | | PsiBlast using the '''''e'''-Value'' cutoff 10E-10 with '''''10''' iterations'' against big_80 |
||
+ | |- |
||
+ | |style="background:#9a32cd;" align="center" | '''Big''' |
||
+ | | PsiBlast using the ''e-Value'' cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against '''''big''''' |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Diminutives, parameters and color scheme of the different pairwise sequence alignment search tools. </caption></small></center> |
||
+ | </figtable> |
||
+ | As it can be seen in '''<xr id="Dimin"></xr>''', the simple Blast search was done against big_80 as well as the different PsiBlast searches. '''Big''' is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is run with an already existing database of Uniprot. |
||
− | Canavan Disease has a variety of different phenotypes ranging across all body parts. |
||
− | Here is a short overview: |
||
− | * Head |
||
− | ** macrocephaly (increased head circumference) |
||
− | ** mental retardation and impairment (losing mental skills) |
||
− | ** losing ability to move head |
||
− | * Eyes |
||
− | ** becoming blind |
||
− | ** nystagmus (greek: νυσταζω ''nytaxoo'' "sleep, nod", german: "Augenzittern") |
||
− | * Ears |
||
− | ** becoming deaf |
||
− | * Mouth |
||
− | ** problems with swallowing |
||
− | ** losing communicational abilities (cannot talk, stay quiet) |
||
− | * Body |
||
− | ** paralysis |
||
− | ** seizures |
||
− | ** problems moving the muscles |
||
+ | === Comparison of Search Tools === |
||
− | Children suffering from Canavan Disease usually die within the first decade. |
||
− | In the mild/juvenile form of Canavan Disease, the children usually have some developmental delay and some speech problems. |
||
+ | The next step was evaluating the different search tools. Here it was focused on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This approach was chosen to gain comparable results for Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.<br> |
||
+ | As it can be seen in '''<xr id="eVal">Figure</xr>''' the distribution of the e-Values found using the different methods is comparable: |
||
+ | <figure id="eVal"> |
||
− | == Disease Mechanism == |
||
+ | [[Image:CanavanEValueDistribution.png|centre|thumb|694px|'''<caption>'''e-Value distribution of different pairwise sequence alignment tools. Outliers were removed to keep track over the data.</caption>]] |
||
− | <figure id="KEGG"> |
||
− | [[Image:Canavan disease pathway KEGG.png|thumb|750px|'''<caption>'''Alanine, Aspartate and Glutamate Metabolism (source: [http://www.kegg.jp/kegg-bin/show_pathway?hsadd00250+443 KEGG]) highlighting disease associated enzymes of Canavan Disease.</caption>]] |
||
</figure> |
</figure> |
||
− | Canavan Disease belongs to the group of leukodystrophies. The etymological origin are the greek words: λευκος ''leukos'' "white", δυς ''dys'' "bad, wrong" and τροφη ''trophae'' "feeding, growth". This is a genetic induced metabolic disorder, which affects the white matter of the nervous system. If the white matter is not properly grown, the myelin, which surrounds the nerve cells for protection, is degraded. This is especially true for Canavan Disease. The visible phenotypes are a result of a genetic defect that negatively affects the growth of the myelin sheath covering the nerve fibers. A improperly build myelin sheath, results in a reduced ability to transmit the electric signal along the nerve fibers, eventually losing it completely and finally the degradation of whole nerve cells. <br> |
||
− | The cause for the malfunctioning myelin sheath growth is a genetic defect of the aspartoacylase (ASPA) gene. The product of the gene, the enzyme aspartoacylase is crucial in the degradation process of N-acetyl-L-aspartate (NAA) which is present at much higher levels than normal in patients suffering from Canavan Disease. Normally ASPA would degrade NAA into smaller fragments which are required prerequisites for the production of the myelin sheath (see '''<xr id="KEGG">Figure</xr>''' for an overview where APSA is located in the metabolic map). Therefore the missing / defective ASPA is reason for the defective build up process of myelin. The degradation of the nerve cells / white brain matter has the consequence that empty spaces are arising which are filled with brain fluid leading to even more degradation of nerve cells and signal transduction problems. |
||
+ | As mentioned above the best way to compare the methods is looking at the graphs for Blast, HHblits and Big (compare '''<xr id="eVal">Figure</xr>'''). With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity decreases as insignificant hits are incorporated into the result. |
||
− | == Diagnosis == |
||
+ | <figure id="seqID"> |
||
− | There are a couple of possibilities how and when an affected patient is diagnosed with Canavan Disease. The time points are prenatal, postnatal, and when a mild or juvenile form of Canavan Disease is already present. Nevertheless one of the most important things to know before is if both parents carry one copy of the disease causing gene. This can simply be done by DNA testing. |
||
+ | [[Image:CanavanDistributionSeqId.png|centre|thumb|694px|'''<caption>'''Percent sequence identity distribution of different pairwise sequence alignment tools.</caption>]] |
||
+ | </figure> |
||
+ | Comparing the distribution of sequence identities for the used methods (see '''<xr id="seqID">Figure</xr>''') it gets visible, that more iterations used for the PsiBlast search result in a lower percentage sequence identity. Again, per iteration the specificity decreases with an increase of not significant hits. |
||
− | ==== Prenatal Diagnosis ==== |
||
+ | '''<xr id="Overlap"></xr>''' represents the overlapping sequence hits between methods. The diagonal shows the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are displayed. |
||
− | There are several types of prenatal testing possibilities depending on whether the carrier status of both parents is known or not. For couples where it is only known that one of the parents is a carrier and the remaining parents status is not known, normally testing is done by measuring the concentration of N-acetyl-L-aspartic acid (NAA) in the amniotic fluid within the time between the 16th and 18th week of pregnancy. |
||
− | Another possibility is molecular genetic testing. Following this method an analysis of DNA extracted from fetal cells is done. These fetal cells are obtained either between the tenth to 12th week of pregnancy by chorionic villus (“proto-”placental tissue that has the same genetic material as the fetus) sampling or between the 15th and 18th week by amniocentesis, also known as amniotic fluid testing (AFT). However for the molecular genetic testing both disease causing genes of the parents have to be identified first. |
||
+ | <figtable id="Overlap"> |
||
− | ==== Neonatal / Infantile Diagnosis ==== |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="9" style="background:#87cefa;" | Overlapping Sequence-Hits |
||
+ | |- |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| Blast |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| HHblits |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast D2 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast e2 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast D10 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| PsiBlast e10 |
||
+ | ! style="background:#bfbfbf;" align="center" width="80"| Big |
||
+ | |- |
||
+ | ! style="background:#ff7f00;" align="center" | Blast |
||
+ | | align="center" |194 |
||
+ | | align="center" |24 (145) |
||
+ | | align="center" |186 |
||
+ | | align="center" |186 |
||
+ | | align="center" |183 |
||
+ | | align="center" |185 |
||
+ | | align="center" |185 |
||
+ | |- |
||
+ | ! style="background:#00cd00;" align="center" | HHblits |
||
+ | | align="center" |24 (145) |
||
+ | | align="center" |276 (3128) |
||
+ | | align="center" |94 (719) |
||
+ | | align="center" |84 (666) |
||
+ | | align="center" |145 (1043) |
||
+ | | align="center" |143 (1062) |
||
+ | | align="center" |195 (2847) |
||
+ | |- |
||
− | Postnatal testing for Canavan Disease can be done in several ways. One possibility is to test for a raised N-acetyl-L-aspartic acid (NAA) concentration in urine, blood and cerebrospinal fluid (CSF) (comparable to prenatal testing with the carrier status of one parent unknown). |
||
+ | ! style="background:#00bfff;" align="center" | PsiBlast D2 |
||
− | Other possibilities may be cultivating skin fibroblasts and test them for reduced aspartoacylase activity, perform neuroimaging of the brain and look for spongy degeneration, or test the gene itself for a defect in the newborn child. However it takes between three to nine months after birth until most of the symptoms become apparent. |
||
+ | | align="center" |186 |
||
+ | | align="center" |94 (719) |
||
+ | | align="center" |918 |
||
+ | | align="center" |830 |
||
+ | | align="center" |894 |
||
+ | | align="center" |899 |
||
+ | | align="center" |899 |
||
+ | |- |
||
+ | ! style="background:#1c86ee;" align="center" | PsiBlast e2 |
||
+ | | align="center" |186 |
||
+ | | align="center" |84 (666) |
||
+ | | align="center" |830 |
||
+ | | align="center" |838 |
||
+ | | align="center" |818 |
||
+ | | align="center" |824 |
||
+ | | align="center" |824 |
||
+ | |- |
||
+ | ! style="background:#009acd;" align="center" | PsiBlast D10 |
||
+ | | align="center" |183 |
||
+ | | align="center" |145 (1043) |
||
+ | | align="center" |894 |
||
+ | | align="center" |818 |
||
+ | | align="center" |3505 |
||
+ | | align="center" |2513 |
||
+ | | align="center" |2335 |
||
+ | |- |
||
+ | ! style="background:#104e8b;" align="center" | PsiBlast e10 |
||
+ | | align="center" |185 |
||
+ | | align="center" |143 (1062) |
||
+ | | align="center" |899 |
||
+ | | align="center" |824 |
||
+ | | align="center" |2513 |
||
+ | | align="center" |2725 |
||
+ | | align="center" |2432 |
||
+ | |- |
||
+ | ! style="background:#9a32cd;" align="center" | Big |
||
+ | | align="center" |185 |
||
+ | | align="center" |195 (2847) |
||
+ | | align="center" |899 |
||
+ | | align="center" |824 |
||
+ | | align="center" |2335 |
||
+ | | align="center" |2432 |
||
+ | | align="center" |6170 |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Number of overlapping sequences using different methods. The diagonal (e.g. intersection Blast Blast) corresponds to the number<br>of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster (in brackets).</caption></small></center> |
||
+ | </figtable> |
||
+ | For a better overview '''<xr id="venn"></xr>''' shows the number of sequences, which overlap between Blast, HHblits and Big: |
||
− | ==== Mild / Juvenile Diagnosis ==== |
||
+ | <figure id="venn"> |
||
− | Diagnosing a patient with Canavan Disease if he or she is suffering from a mild or juvenile form, is a bit more challenging, as the postnatal diagnosis methods, except testing the gene itself, will not yield in a satisfactory result or may even overlook the disease completely. The concentration of NAA may be elevated only slightly and not as significant such that a proper diagnosis can be made. The same being true for the results of neuroimaging, and the mild developmental delay that is a result of Canavan Disease which can simply be unrecognized. |
||
+ | [[Image:Canavan_Overlapping_sequence_hits_venn-diagram.png|centre|thumb|350px|'''<caption>'''Number of overlapping sequences between different pairwise sequence alignment methods (Blast, HHblits and Big).</caption>]] |
||
+ | </figure> |
||
+ | === Validation using GeneOntology === |
||
+ | The validation using GeneOntology takes all cluster members of HHblits into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. '''<xr id="GOAnn">Table</xr>''' shows the top 10 results with a counter (how often they appeared): |
||
− | == Treatment == |
||
+ | <figtable id="GOAnn"> |
||
− | Right now there is no cure for Canavan Disease, but there are treatments depending on the symptoms, which work in a supportive manner. |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
− | |||
+ | |- |
||
− | ==== Prenatal Treatment ==== |
||
+ | ! colspan="2" style="background:#87cefa;" | GO-Annotations of all Common Proteins |
||
− | |||
+ | |- |
||
− | There is a possibility of prenatal screening to check whether or not someone is a carrier of the disease (as described in the section before). Other prenatal treatments are under investigation and depend on animal models. |
||
+ | ! style="background:#bfbfbf;" align="center" | Count |
||
− | |||
+ | ! style="background:#bfbfbf;" align="center" | Annotation |
||
− | ==== Neonatal / Infantile Treatment ==== |
||
+ | |- |
||
+ | | 141 |
||
+ | | hydrolase activity, acting on ester bonds |
||
+ | |- |
||
+ | | 141 |
||
+ | | metabolic process |
||
+ | |- |
||
+ | | 113 |
||
+ | | hydrolase activity |
||
+ | |- |
||
+ | | 112 |
||
+ | | metal ion binding |
||
+ | |- |
||
+ | | 75 |
||
+ | | zinc ion binding |
||
+ | |- |
||
+ | | 69 |
||
+ | | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
||
+ | |- |
||
+ | | style="background:#87cefa;" | '''58''' |
||
+ | | style="background:#87cefa;" | '''aspartoacylase activity''' |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine metabolic process |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine catabolic process to glutamate |
||
+ | |- |
||
+ | | 20 |
||
+ | | arginine catabolic process to succinate |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Top 10 GO-Annotations among all common proteins of the pairwise sequence alignments<br>(Blast, HHblits, PsiBlast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major<br>enzyme causing Canavan Disease. </caption></small></center> |
||
+ | </figtable> |
||
+ | As displayed in '''<xr id="GOAnn"></xr>''', the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase activity is the enzymatic reaction of ASPA. As aspartoacylase contains a bound zinc ion, the high occurrence of metal & zinc ion binding is to be expected. Furthermore aspartoacylase is a hydrolase subtype. The last three high ranking GO-Terms are explainable with the fact that L-Aspartate, a chemical compound in the reaction ASPA catalyses, is a prior substrate of the arginine metabolism ('''[[Canavan_Disease#Disease Mechanism|Task 01]]'''). |
||
− | Since Canavan Disease also affects the metabolism there is need to control the nutrition and hydration. This includes specialized food to compensate missing metabolites and nutrients as well as different ways of feeding / providing nutrition to the child to prevent problems arising from swallowing difficulties and other physical disabilities. To improve those physical disabilities and muscle problems, it is recommended that children need physical therapy. Additionally there are anti-epileptic drugs against seizures and spastic behavior. |
||
+ | == Multiple Sequence Alignments == |
||
− | ==== Mild / Juvenile Treatment ==== |
||
+ | To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were built randomly based on the '''Big''' data, as it gave the possibility to take PDB-structures into account. The randomly generated sets of sequence identity above 60% and below 30% each contain 10 sequences. The ''whole range set'' contains 20 sequences and '''is a mixture''' between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as shown in '''<xr id="seqID"></xr>'''.<br> |
||
+ | Concerning the secondary structure, in the above60 set and in the whole range set the same protein was used to calculate the differences ([http://www.uniprot.org/uniprot/P45381 2O4H]). It contains information about the active site, binding regions, metal binding sites and simple binding sites. For the below30 set finding a PDB-structure with information comparable to aspartoacylase was quite difficult. The protein [http://www.uniprot.org/uniprot/Q088B8 Succinylglutamate desuccinylase/aspartoacylase] from a bacteria species seemed to be the best for this purpose. Unfortunately only metal binding sites and simple binding sites are annotated in Uniprot for this protein. Therefore no information concerning active site or binding region is available and the conservation of those sites within the multiple sequence alignment could not be detected. |
||
+ | '''<xr id="MSA">Table</xr>''' should give a nice overview of the resulting multiple sequence alignments. |
||
− | Since mild and juvenile Canavan patients only have some delays in the development and speech, a speech therapy may be useful. Further deep medical care is not necessary. |
||
+ | <figtable id="MSA"> |
||
− | == Future Work == |
||
− | |||
− | There are some clinical trials and animal models under investigation to find a cure for Canavan Disease. |
||
− | |||
− | ==== Gene Therapy ==== |
||
− | |||
− | There were several studies in the gene therapy, using viral and non viral vectors to transfer genes into the patients that were thought to improve the course of the disease. However none of the children showed an improvement and the disease showed a development similar to an untreated patient. |
||
− | |||
− | ==== Lithium Citrate as Pharmaceutical ==== |
||
− | |||
− | Since N-acetyl-L-aspartate (NAA) is one important factor in the biochemical background of Canavan Disease, where the NAA level is too high, lithium citrate may be able to reduce the NAA concentration. Rat models have shown that treating a rat with lithium citrate resulted in a reduced level of NAA. Furthermore if the drug is administered to a human the same effect can be observed with a return to elevated NAA concentration when the lithium citrated is washed out of the body after roughly 2 weeks. However so far no larger controlled clinical studies have been conducted, but lithium citrate shows a potential treatment that is worth pursuing. |
||
− | |||
− | ==== Animal Models ==== |
||
− | |||
− | Several gene models in knockout mice and rats have been studied, with lithium citrate and an enzyme replacement therapy showing the best result so far and therefore being the most promising at the moment. |
||
− | |||
− | |||
− | == Aspartoacylase (ASPA) == |
||
− | <figure id="AspaKegg"> |
||
− | [[Image:NAA hydrolyzation.gif|thumb|450px|'''<caption>'''The hydrolyzation of N-acetyl-L-aspartate (C01042) catalyzed by aspartoacylase to acetyl (C00033) and aspartate (C00049). (source: [http://www.kegg.jp/dbget-bin/www_bget?R00488 KEGG])</caption>]] |
||
− | </figure> |
||
− | |||
− | ==== Summary ==== |
||
− | Aspartoacylase is the enzyme that hydrolyzes N-acetyl-L-aspartate into acetate and L-aspartate, which are essential for the build-up process of the myelin sheath (chemical reaction displayed in '''<xr id="AspaKegg">Figure</xr>'''). Crystallized ASPA exists as a homodimer however it is assumed that the in-vivo form only works as a monomer. The active site of ASPA contains a zinc ion which acts catalytic in the hydrolyzation process and is only accessible through a channel like surface fold of the protein. This channel like structure serves two purposes. On the one hand it hinders polypeptides to enter and bind at the active site, therefore ASPA does not function as protease. On the other hand and more importantly it is assumed, that the positive electrostatic potential that is present on the channel serves as a form of transport mechanism to properly carry the negatively charged substrate (NAA) to the hydrolyzing site. Furthermore, the binding pocket is highly specific to N-acetyl-L-aspartate with a far lower hydrolyzing activity towards other N-acetyl-amino complexes like N-acetylglutamate. |
||
− | |||
− | ==== Gene Position and Mutations ==== |
||
− | |||
− | The ASPA gene is located on chromosome 17 on the p-arm (upper part, short arm) band 1 subband 3 subsubband 2 (short 17p13.2) (see '''<xr id="Location">Figure</xr>'''). |
||
− | <figure id="Location"> |
||
− | [[Image:ASPA gene location.png|thumb|centre|750px|'''<caption>'''Chromosome 17 with highlighted position of ASPA-gene. (source: [http://www.genecards.org/cgi-bin/carddisp.pl?gene=ASPA Genecards])</caption>]] |
||
− | </figure> |
||
− | |||
− | ===== Reference Sequence ===== |
||
− | *[[ASPA#Genomic Sequence|Reference sequence (genomic) of ASPA]] |
||
− | *[[ASPA#Protein Sequence|Reference sequence (protein) of ASPA]] |
||
− | |||
− | ===== Disease Causing Mutations ===== |
||
− | The disease causing mutations can be found in '''<xr id="DisCausMut">Table</xr>''' and '''<xr id="AllelicVar">Table</xr>''' below. Very interesting in this Table is the frequency of some mutations across different populations. |
||
− | <figtable id="DisCausMut"> |
||
{| border="1" cellpadding="5" cellspacing="0" align="center" |
{| border="1" cellpadding="5" cellspacing="0" align="center" |
||
|- |
|- |
||
− | ! colspan=" |
+ | ! colspan="13" style="background:#87cefa;" | Comparison of Multiple Sequence Alignment Tools |
|- |
|- |
||
− | ! style="background:#BFBFBF;" align="center" | |
+ | ! style="background:#BFBFBF;" align="center" | |
− | ! style="background:#BFBFBF;" | |
+ | ! colspan="3" style="background:#BFBFBF;" | MAFFT |
− | ! colspan=" |
+ | ! colspan="3" style="background:#BFBFBF;" | MUSCLE |
− | ! colspan=" |
+ | ! colspan="3" style="background:#BFBFBF;" | T-COFFEE |
− | ! style="background:#BFBFBF;" | |
+ | ! colspan="3" style="background:#BFBFBF;" | EXPRESSO |
|- |
|- |
||
− | ! style="background:#E5E5E5;" align="center" | |
+ | ! style="background:#E5E5E5;" align="center" | Set |
− | ! style="background:#E5E5E5;" align="center" | |
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
− | ! |
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
− | ! style="background:#E5E5E5;" align="center" | |
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
− | ! style="background:#E5E5E5;" align="center" | |
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
− | ! style="background:#E5E5E5;" align="center" | |
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
|- |
|- |
||
− | + | ! align="center" | Gaps |
|
+ | | align="center" | 9 |
||
− | |rowspan="2"| Targeted mutation analysis |
||
− | | |
+ | | align="center" | 532 |
+ | | align="center" | 444 |
||
− | || p.Glu282Ala, p.Tyr231X |
||
+ | | align="center" | 9 |
||
− | || 98% |
||
+ | | align="center" | 325 |
||
− | || 3% |
||
+ | | align="center" | 325 |
||
− | |rowspan="4"| Clinical Testing |
||
+ | | align="center" | 9 |
||
+ | | align="center" | 922 |
||
+ | | align="center" | 957 |
||
+ | | align="center" | 18 |
||
+ | | align="center" | 964 |
||
+ | | align="center" | 910 |
||
|- |
|- |
||
+ | ! align="center" | Blocks |
||
− | || p.Ala305Glu |
||
+ | | align="center" | 3 |
||
− | || 1% |
||
+ | | align="center" | 48 |
||
− | || 30%-60% |
||
+ | | align="center" | 26 |
||
+ | | align="center" | 2 |
||
+ | | align="center" | 38 |
||
+ | | align="center" | 23 |
||
+ | | align="center" | 3 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 58 |
||
+ | | align="center" | 4 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 66 |
||
|- |
|- |
||
+ | ! align="center" | Avg Block Size |
||
− | || Sequence analysis |
||
+ | | align="center" | 104.3 |
||
− | |colspan="2"| Sequence variants |
||
+ | | align="center" | 4.6 |
||
− | || N/A |
||
+ | | align="center" | 11.7 |
||
− | || 87% |
||
+ | | align="center" | 156.5 |
||
+ | | align="center" | 8.2 |
||
+ | | align="center" | 13.5 |
||
+ | | align="center" | 104.3 |
||
+ | | align="center" | 4.0 |
||
+ | | align="center" | 5.3 |
||
+ | | align="center" | 87.0 |
||
+ | | align="center" | 3.2 |
||
+ | | align="center" | 4.6 |
||
|- |
|- |
||
+ | ! align="center" | Well Conserved Columns |
||
− | || Deletion / duplication analysis |
||
+ | | align="center" | 252 |
||
− | |colspan="2"| Large genomic deletions/duplications <br> comprising one or more exons |
||
+ | | align="center" | 14 |
||
− | || N/A |
||
+ | | align="center" | 14 |
||
− | || Unknown (<10%) |
||
+ | | align="center" | 251 |
||
+ | | align="center" | 1 |
||
+ | | align="center" | 5 |
||
+ | | align="center" | 252 |
||
+ | | align="center" | 12 |
||
+ | | align="center" | 13 |
||
+ | | align="center" | 248 |
||
+ | | align="center" | 11 |
||
+ | | align="center" | 15 |
||
+ | |- |
||
+ | ! align="center" | All Columns |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 223 |
||
+ | | align="center" | 304 |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 312 |
||
+ | | align="center" | 311 |
||
+ | | align="center" | 313 |
||
+ | | align="center" | 193 |
||
+ | | align="center" | 310 |
||
+ | | align="center" | 312 |
||
+ | | align="center" | 152 |
||
+ | | align="center" | 302 |
||
|- |
|- |
||
|} |
|} |
||
− | <center><small>'''<caption>''' |
+ | <center><small>'''<caption>''' Comparison of MSA tools using sets of different sequence identity with respect to gaps, blocks and conserved columns. </caption></small></center> |
</figtable> |
</figtable> |
||
+ | '''<xr id="MSA">Table</xr>''' describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Additionally it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.<br> |
||
− | <figtable id="AllelicVar"> |
||
+ | To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, '''<xr id="secstruc">Table</xr>''' should give an overview: |
||
+ | <figtable id="secstruc"> |
||
{| border="1" cellpadding="5" cellspacing="0" align="center" |
{| border="1" cellpadding="5" cellspacing="0" align="center" |
||
|- |
|- |
||
− | ! colspan=" |
+ | ! colspan="15" style="background:#87cefa;" | Secondary Structure and Functional Residues in MSA Tools |
|- |
|- |
||
− | ! style="background:#BFBFBF;" align="center" | |
+ | ! colspan="2" style="background:#BFBFBF;" align="center" | |
− | ! style="background:#BFBFBF; |
+ | ! colspan="3" style="background:#BFBFBF;" | MAFFT |
− | ! style="background:#BFBFBF; |
+ | ! colspan="3" style="background:#BFBFBF;" | MUSCLE |
+ | ! colspan="3" style="background:#BFBFBF;" | T-COFFEE |
||
+ | ! colspan="3" style="background:#BFBFBF;" | EXPRESSO |
||
|- |
|- |
||
+ | ! style="background:#E5E5E5;" align="center" | Set |
||
− | || c.433-2A>G |
||
+ | ! style="background:#E5E5E5;" align="center" | |
||
− | || - |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
− | |rowspan="5"| NM_000049.2 <br> NP_000040.1 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| >60 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| <30 |
||
+ | ! style="background:#E5E5E5;" align="center" width="50"| whole |
||
|- |
|- |
||
+ | ! rowspan="2" align="center" | Gaps |
||
− | || c.693C>A |
||
+ | | Gap in SecStruc |
||
− | || p.Tyr231X |
||
+ | || 0 || 0 || 0 |
||
+ | || 0 || 2 || 0 |
||
+ | || 0 || 4 || 0 |
||
+ | || 0 || 5 || 0 |
||
|- |
|- |
||
+ | | SecStruc in Gap |
||
− | || c.854A>C |
||
+ | || 0 || 96 || 2 |
||
− | || p.Glu285Ala |
||
+ | || 0 || 48 || 0 |
||
+ | || 0 || 116 || 1 |
||
+ | || 0 || 141 || 2 |
||
|- |
|- |
||
+ | ! rowspan="2" align="center" | SecStruc |
||
− | || c.863A>G |
||
+ | | Breaks |
||
− | || p.Tyr288Cys |
||
+ | || 0 || 9 || 4 |
||
+ | || 0 || 9 || 11 |
||
+ | || 0 || 23 || 20 |
||
+ | || 0 || 24 || 19 |
||
|- |
|- |
||
+ | | in Helix |
||
− | || c.914C>A |
||
+ | || 0 || 6 || 1 |
||
− | || p.Ala305Glu |
||
+ | || 0 || 6 || 4 |
||
+ | || 0 || 8 || 12 |
||
+ | || 0 || 13 || 10 |
||
+ | |- |
||
+ | ! rowspan="3" align="center" | FuncRes |
||
+ | | Missing |
||
+ | || - || - || - |
||
+ | || - || - || - |
||
+ | || - || B || - |
||
+ | || - || - || - |
||
+ | |- |
||
+ | | Well Conserved |
||
+ | || AMBR || M || MBR |
||
+ | || AMBR || - || R |
||
+ | || AMBR || M || MR |
||
+ | || AMBR || M || MBR |
||
+ | |- |
||
+ | | Less Conserved |
||
+ | || - || B || ABR |
||
+ | || - || BM || AMBR |
||
+ | || - || - || AMBR |
||
+ | || - || B || ABR |
||
|- |
|- |
||
|} |
|} |
||
+ | <center><small>'''<caption>''' Comparison of secondary structure (SecStruc) and functional residues (FuncRes) in MSA tools.<br>gap in secstruc - shows a gap in the secondary structure compared to consensus sequence <br> secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence.<br> A=active site M=metal binding B=binding site R=binding region</caption></small></center> |
||
− | <center><small>'''<caption>''' Disease causing Mutations in Canavan Disease. (source: [http://www.ncbi.nlm.nih.gov/books/NBK1234/ NCBI]) </caption></small></center> |
||
</figtable> |
</figtable> |
||
+ | The first part of '''<xr id="secstruc">Table</xr>''' focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a ''gap in the secondary structure''.<br> |
||
− | == References == |
||
+ | The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.<br> |
||
− | The written text is based on a summary of different sources: <br> |
||
+ | At last the functional residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active site, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region. |
||
− | Canavan Disease: |
||
+ | |||
− | * [http://en.wikipedia.org/wiki/Myrtelle_Canavan Wikipedia: Mytelle Canavan] |
||
+ | ====MAFFT==== |
||
− | * [http://ghr.nlm.nih.gov/condition/canavan-disease Genetics Home Reference] |
||
+ | '''[http://www.ebi.ac.uk/Tools/msa/mafft/ MAFFT]''' is a very fast method producing good alignments: Like MUSCLE it finds blocks and long consensus sequences. This is an advantage for comparing the columns with active sites, or metal binding properties (functional important residues). These are almost all very good conserved throughout MAFFT. Like MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one. |
||
− | * [http://www.pnas.org/content/104/2/456.short PNAS] |
||
+ | |||
− | * [https://www.counsyl.com/diseases/canavan-disease/ Counsyl] |
||
+ | ====MUSCLE==== |
||
− | * [http://omim.org/entry/271900 OMIM] |
||
+ | '''[http://www.ebi.ac.uk/Tools/msa/muscle/ MUSCLE]''' performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secondary structure breaks. This method is very promising. |
||
− | * [http://www.canavanfoundation.org Canavan Foundation] |
||
+ | |||
− | * [http://www.canavandisease.net CanavanDisease.net] |
||
+ | ====T-COFFEE==== |
||
− | * [http://www.nlm.nih.gov/medlineplus/ency/article/001586.htm Medline Plus] |
||
+ | The alignment produced by '''[http://tcoffee.crg.cat/apps/tcoffee/do:regular T-COFFEE]''' contains the most gaps and looking at the blocks and columns found, it is comparable to MAFFT. T-COFFEE seems not to be able to handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. T-COFFEE is the only method that even misses a functional residue of a binding site. |
||
− | * [http://www.ninds.nih.gov/disorders/canavan/canavan.htm National Institute of Neurological Disorders] |
||
+ | |||
− | * [http://www.ncbi.nlm.nih.gov/books/NBK1234/ NCBI bookshelf] |
||
+ | ====EXPRESSO==== |
||
− | * [http://www.kegg.jp/kegg-bin/get_htext?htext=br08402.keg&query=canavan KEGG] |
||
+ | Instead of 3D-COFFEE '''[http://tcoffee.crg.cat/apps/tcoffee/do:expresso EXPRESSO]''' was used for an alignment with secondary structure information. It improves T-COFFEE slightly concerning the low sequence similarity searches, especially in context of conserved columns. In contrast to T-COFFEE it finds at all functional residues, but the results are nearly the same. |
||
− | * [http://rarediseases.info.nih.gov/gard/5984/canavan-disease/resources/1 US Department of Health and Human Services - Genetic Rare Disease Information Center] |
||
+ | |||
+ | ====Overall==== |
||
+ | As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set. |
||
+ | The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better did the algorithms perform.<br> |
||
+ | All functional important sites as referred either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the "bad" alignments of the below30 set.<br> |
||
+ | In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.<br> |
||
+ | |||
+ | ====Example Alignment==== |
||
+ | To give an impression, in '''<xr id="msaMafft">Figure</xr>''' one multiple sequence alignment is displayed, showing the set above 60 percent sequence identity on MAFFT using [http://www.cgl.ucsf.edu/chimera/ '''UCSF Chimera''']: |
||
+ | |||
+ | <figure id="msaMafft"> |
||
+ | [[Image:CanavanMafftAbove60Alignment.png|centre|thumb|694px|'''<caption>'''Multiple sequence alignment of the above 60% sequence identity set using '''MAFFT'''.</caption>]] |
||
+ | </figure> |
||
+ | |||
+ | For reasons of clarity other alignments can be found in the '''[[Canavan_Disease:_Task_02_-_Alignments#Supplement|Supplement]]'''. |
||
+ | == [[Canavan_Disease:_Task_02_-_Supplement|Supplement]] == |
||
− | ASPA: |
||
− | * [http://omim.org/entry/608034 OMIM] |
||
− | * [http://www.uniprot.org/uniprot/P45381 Uniprot] |
||
== Tasks == |
== Tasks == |
Latest revision as of 11:46, 5 September 2013
The first step to gain more insight on the genetic level of aspartoacylase (ASPA) would be conducting a sequence search for similar sequences. This is done by performing sequence alignments using ASPA as template.
Contents
Pairwise sequence alignments
Pairwise sequence alignments help to find related sequences. In this Task ASPA that is the major cause of Canavan Disease, if it contains specific mutations, is used as input sequence for diverse pairwise sequence alignment methods. The sequence searches were performed with Blast, HHblits, and multiple PsiBlast runs using varying parameters. To get a good and clear overview, diminutives and a color scheme were used to describe and compare all methods (see <xr id="Dimin"></xr>):
<figtable id="Dimin">
Diminutives for Pairwise Sequence Alignment Searches | |
---|---|
Diminutive | Encryption |
Blast | Blast against big_80 |
HHblits | HHblits standard parameters against Uniprot |
PsiBlast D2 | PsiBlast using the default e-Value cutoff (0.002) with 2 iterations against big_80 |
PsiBlast e2 | PsiBlast using the e-Value cutoff 10E-10 with 2 iterations against big_80 |
PsiBlast D10 | PsiBlast using the default e-Value cutoff (0.002) with 10 iterations against big_80 |
PsiBlast e10 | PsiBlast using the e-Value cutoff 10E-10 with 10 iterations against big_80 |
Big | PsiBlast using the e-Value cutoff 10E-10 with 1 iteration and the checkfile from PsiBlast e10 against big |
</figtable>
As it can be seen in <xr id="Dimin"></xr>, the simple Blast search was done against big_80 as well as the different PsiBlast searches. Big is a single iteration against big (not big_80) but using the checkfile of the 10th iteration of PsiBlast e10. This decision was made because big_80 is already redundancy reduced and may overpredict using many iterations as well as to be able to compare the performance of PsiBlast and HHblits. Since it would be necessary for HHblits to create a special database, it is run with an already existing database of Uniprot.
Comparison of Search Tools
The next step was evaluating the different search tools. Here it was focused on the e-Value distribution, the sequence identity distribution and the overlap of the found sequences. Since HHblits consists of clusters, the calculations are always corresponding to a complete cluster, not just the cluster representative of Uniprot. This approach was chosen to gain comparable results for Blast, HHblits and Big, whereas Big should reflect a representative for the PsiBlast searches.
As it can be seen in <xr id="eVal">Figure</xr> the distribution of the e-Values found using the different methods is comparable:
<figure id="eVal">
</figure>
As mentioned above the best way to compare the methods is looking at the graphs for Blast, HHblits and Big (compare <xr id="eVal">Figure</xr>). With a higher number of iterations in the PsiBlast search the e-Value composition gets larger. That is due to the fact that per iteration the specificity decreases as insignificant hits are incorporated into the result.
<figure id="seqID">
</figure>
Comparing the distribution of sequence identities for the used methods (see <xr id="seqID">Figure</xr>) it gets visible, that more iterations used for the PsiBlast search result in a lower percentage sequence identity. Again, per iteration the specificity decreases with an increase of not significant hits.
<xr id="Overlap"></xr> represents the overlapping sequence hits between methods. The diagonal shows the number of sequences in the method itself. Again it is important to mention that HHblits consists of clusters. In the table the cluster representatives as well as all cluster members (in brackets) are displayed.
<figtable id="Overlap">
Overlapping Sequence-Hits | ||||||||
---|---|---|---|---|---|---|---|---|
Blast | HHblits | PsiBlast D2 | PsiBlast e2 | PsiBlast D10 | PsiBlast e10 | Big | ||
Blast | 194 | 24 (145) | 186 | 186 | 183 | 185 | 185 | |
HHblits | 24 (145) | 276 (3128) | 94 (719) | 84 (666) | 145 (1043) | 143 (1062) | 195 (2847) | |
PsiBlast D2 | 186 | 94 (719) | 918 | 830 | 894 | 899 | 899 | |
PsiBlast e2 | 186 | 84 (666) | 830 | 838 | 818 | 824 | 824 | |
PsiBlast D10 | 183 | 145 (1043) | 894 | 818 | 3505 | 2513 | 2335 | |
PsiBlast e10 | 185 | 143 (1062) | 899 | 824 | 2513 | 2725 | 2432 | |
Big | 185 | 195 (2847) | 899 | 824 | 2335 | 2432 | 6170 |
of sequences found in this search. In HHblits there is a differentiation between cluster representative and complete cluster (in brackets).
</figtable>
For a better overview <xr id="venn"></xr> shows the number of sequences, which overlap between Blast, HHblits and Big:
<figure id="venn">
</figure>
Validation using GeneOntology
The validation using GeneOntology takes all cluster members of HHblits into account. The resulting intersection of all methods, including the different PsiBlast searches is 142 proteins. For those proteins the GO-Annotations were analyzed. <xr id="GOAnn">Table</xr> shows the top 10 results with a counter (how often they appeared):
<figtable id="GOAnn">
GO-Annotations of all Common Proteins | |
---|---|
Count | Annotation |
141 | hydrolase activity, acting on ester bonds |
141 | metabolic process |
113 | hydrolase activity |
112 | metal ion binding |
75 | zinc ion binding |
69 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
58 | aspartoacylase activity |
20 | arginine metabolic process |
20 | arginine catabolic process to glutamate |
20 | arginine catabolic process to succinate |
(Blast, HHblits, PsiBlast D2/e2/D10/e10 and Big). Highlighted: aspartoacylase, since it is the major
enzyme causing Canavan Disease.
</figtable>
As displayed in <xr id="GOAnn"></xr>, the top results show the typical associations to Canavan Diseases. The highlighted aspartoacylase activity is the enzymatic reaction of ASPA. As aspartoacylase contains a bound zinc ion, the high occurrence of metal & zinc ion binding is to be expected. Furthermore aspartoacylase is a hydrolase subtype. The last three high ranking GO-Terms are explainable with the fact that L-Aspartate, a chemical compound in the reaction ASPA catalyses, is a prior substrate of the arginine metabolism (Task 01).
Multiple Sequence Alignments
To analyze methods for multiple sequence alignments, sets of different sequence identity were built. These sets were built randomly based on the Big data, as it gave the possibility to take PDB-structures into account. The randomly generated sets of sequence identity above 60% and below 30% each contain 10 sequences. The whole range set contains 20 sequences and is a mixture between the above60 and below30 set, since any randomly generated set was around the same average sequence identity as the below30 set. This is due to the very large number of sequence identities below 25% as shown in <xr id="seqID"></xr>.
Concerning the secondary structure, in the above60 set and in the whole range set the same protein was used to calculate the differences (2O4H). It contains information about the active site, binding regions, metal binding sites and simple binding sites. For the below30 set finding a PDB-structure with information comparable to aspartoacylase was quite difficult. The protein Succinylglutamate desuccinylase/aspartoacylase from a bacteria species seemed to be the best for this purpose. Unfortunately only metal binding sites and simple binding sites are annotated in Uniprot for this protein. Therefore no information concerning active site or binding region is available and the conservation of those sites within the multiple sequence alignment could not be detected.
<xr id="MSA">Table</xr> should give a nice overview of the resulting multiple sequence alignments.
<figtable id="MSA">
Comparison of Multiple Sequence Alignment Tools | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole |
Gaps | 9 | 532 | 444 | 9 | 325 | 325 | 9 | 922 | 957 | 18 | 964 | 910 |
Blocks | 3 | 48 | 26 | 2 | 38 | 23 | 3 | 48 | 58 | 4 | 48 | 66 |
Avg Block Size | 104.3 | 4.6 | 11.7 | 156.5 | 8.2 | 13.5 | 104.3 | 4.0 | 5.3 | 87.0 | 3.2 | 4.6 |
Well Conserved Columns | 252 | 14 | 14 | 251 | 1 | 5 | 252 | 12 | 13 | 248 | 11 | 15 |
All Columns | 313 | 223 | 304 | 313 | 312 | 311 | 313 | 193 | 310 | 312 | 152 | 302 |
</figtable>
<xr id="MSA">Table</xr> describes the number of gaps comparing the consensus sequence of the multiple sequence alignments. Additionally it refers to the number of blocks between those gaps and their average length. At last the number of well conserved columns and columns in general were compared.
To compare the multiple sequence alignment tools with respect to secondary structure and functional residues, <xr id="secstruc">Table</xr> should give an overview:
<figtable id="secstruc">
Secondary Structure and Functional Residues in MSA Tools | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAFFT | MUSCLE | T-COFFEE | EXPRESSO | |||||||||||
Set | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | >60 | <30 | whole | ||
Gaps | Gap in SecStruc | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 5 | 0 | |
SecStruc in Gap | 0 | 96 | 2 | 0 | 48 | 0 | 0 | 116 | 1 | 0 | 141 | 2 | ||
SecStruc | Breaks | 0 | 9 | 4 | 0 | 9 | 11 | 0 | 23 | 20 | 0 | 24 | 19 | |
in Helix | 0 | 6 | 1 | 0 | 6 | 4 | 0 | 8 | 12 | 0 | 13 | 10 | ||
FuncRes | Missing | - | - | - | - | - | - | - | B | - | - | - | - | |
Well Conserved | AMBR | M | MBR | AMBR | - | R | AMBR | M | MR | AMBR | M | MBR | ||
Less Conserved | - | B | ABR | - | BM | AMBR | - | - | AMBR | - | B | ABR |
gap in secstruc - shows a gap in the secondary structure compared to consensus sequence
secstruc in gap - shows gaps in the consensus sequence compared to secondary structure sequence.
A=active site M=metal binding B=binding site R=binding region
</figtable>
The first part of <xr id="secstruc">Table</xr> focuses on the gaps in the data. It differentiates between gaps in the consensus sequence, but not in the secondary structure (secstruc in gap) and a consensus sequence which shows a gap in the secondary structure (gap in secstruc). The calculation was made the following way: The secondary structure of one aligned sequence (2O4H for above60 and whole range, 3LWU for below30) was compared to the consensus sequence. Thus there might be a gap in the secondary structure.
The second part describes the breaks in the secondary structure elements. Breaks within loop regions were not taken into account.
At last the functional residues as listed in Uniprot were compared to the consensus sequence of the multiple alignments. The reference sequence for the secondary structure elements in the below30 set has no entry for active site or binding region in Uniprot, as explained above. "A" stands for active site, "M" for metal binding (in this case the zinc ion), "B" binding site and "R" for binding region.
MAFFT
MAFFT is a very fast method producing good alignments: Like MUSCLE it finds blocks and long consensus sequences. This is an advantage for comparing the columns with active sites, or metal binding properties (functional important residues). These are almost all very good conserved throughout MAFFT. Like MUSCLE it performs quite well in low sequence similarity scans, but is better concerning breaks in the secondary structure. Summed up, this method seems to be the most promising one.
MUSCLE
MUSCLE performs quite well and finds long blocks. It also brings a long consensus sequence. This also works with only low sequence similarity, but not as good concerning the conserved columns. MUSCLE works not as good as MAFFT concerning the secondary structure breaks. This method is very promising.
T-COFFEE
The alignment produced by T-COFFEE contains the most gaps and looking at the blocks and columns found, it is comparable to MAFFT. T-COFFEE seems not to be able to handle low to medium sequence similarity scans. It always shows the most number of breaks in the secondary structure elements especially in helices. T-COFFEE is the only method that even misses a functional residue of a binding site.
EXPRESSO
Instead of 3D-COFFEE EXPRESSO was used for an alignment with secondary structure information. It improves T-COFFEE slightly concerning the low sequence similarity searches, especially in context of conserved columns. In contrast to T-COFFEE it finds at all functional residues, but the results are nearly the same.
Overall
As there was no big difference between the below 30% and the original whole range scan, the whole range set was changed to a mixture between the above60 and the below30 set.
The overall comparison shows that the sequence identity in the sets definitely influence the alignment: The higher the sequence identity within the alignments the better did the algorithms perform.
All functional important sites as referred either to 2O4H or 3LWU have been detected well. The active site in the whole range set is found, but in lesser conservation. This may be due to the "bad" alignments of the below30 set.
In conclusion MAFFT seems to be the best tool for a good alignment: It shows a good conversation of the aligned blocks, the secondary structure elements and functionally important residues are well conserved.
Example Alignment
To give an impression, in <xr id="msaMafft">Figure</xr> one multiple sequence alignment is displayed, showing the set above 60 percent sequence identity on MAFFT using UCSF Chimera:
<figure id="msaMafft">
</figure>
For reasons of clarity other alignments can be found in the Supplement.
Supplement
Tasks
- Link to Task 01: Canavan Disease
- Link to Task 02: Alignments
- Link to Task 03: Sequence-based Predictions
- Link to Task 04: Structural Alignments
- Link to Task 05: Homology Modelling
- Link to Task 06: Protein Structure Prediction from Evolutionary Sequence Variation
- Link to Task 07: Researching SNPs
- Link to Task 08: Sequence-based Mutation Analysis
- Link to Task 09: Structure-based Mutation Analysis
- Link to Task 10: Normal Mode Analysis