Canavan Task 2 - Sequence alignments
- 1 Protocol
- 2 GO Term Enrichment
- 3 Pairwise Sequence Search
- 4 Validation and Comparison
- 5 Multiple Sequence Alignments
Further information can be found in the protocol.
GO Term Enrichment
In the following we are performing different sequence searches with the protein Aspartoacylase (UniProt ID: P45381). In order to validate the found hits, we are looking for common GO classifications of the hits with the query sequence.
For our protein Aspartoacylase there are 17 annotated GO terms (using EMBLs QuickGO):
|GO ID||GO Name|
|Cellular Compartment||GO:0005634||nucleus (3X)|
|Biological Process||GO:0006533||aspartate catabolic process|
|GO:0022010||central nervous system myelination|
|GO:0048714||positive regulation of oligodendrocyte differentiation|
|Molecular Function||GO:0046872||metal ion binding|
|GO:0016788||hydrolase activity, acting on ester bonds|
|GO:0016811||hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides|
Pairwise Sequence Search
Even when using the non-restricitve default EValue cutoff, we only find 196 hits. Using a stricter cutoff, we find 94 significant hits. Most of the resulting proteins are Aspartoacylases of other species. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which are in the same protein family (Desuccinylase / Aspartoacylase family) and catalyze a reaction similar to Aspartoacylase.
We run PSIBlast with four different parameter combinations:
- 2 iterations (j=2), default E-Value cutoff for inclusion of .. (h=0.002)
- 2 iterations (j=2), strict E-Value cutoff for inclusion of .. (h=10e-10)
- 10 iterations (j=10), default E-Value cutoff for inclusion of .. (h=0.002)
- 10 iterations (j=10), strict E-Value cutoff for inclusion of .. (h=10e-10)
As a cutoff value for significance of hits we chose 2e-10.
The Psiblast run with default parameters (2 iterations, EValue 0.002) results in an amount of 597 hits. The most significant hit has an E-Value of e-142. When restricting the search to a lower E-Value of 10e-10, we still get 502 hits.
only get 93 hits. This is similar to the result from the BlastP search, only that the worst hit still has a significant EValue of 2e-29. Hits with the good EValues are mostly Aspartoacylases, while hits with worse EValues contain more and more Succinylglutamate Desuccinylases.
Increasing the number of iterations obviously results in many more hits. Using different EValue cutoffs has little influence on the found results though. After 8 rounds, the search converged using the default EValue, while it run all 10 rounds using the more restrictive EValue cutoff. Interestingly, the best hits are less significant (~e-70) than for the psiblast search with only 2 iterations (~e-140). Also, we feel that the most significant results include more Succinylglutamate Desuccinylases than Aspartoacylases.
|Parameters||it2, def E-Value (h=2e-3)||it2, E-Value h=10e-10||it10 def E-Value (h=2e-3)||it10 E-Value h=10e-10|
|results (EVal: 10)||915||835||500||500|
|results (EVal: 2e-3)||597||502||500||500|
commentmixed results with Aspartoacylases and Succivery varying results: Aspartoacylasen, Succinylasen, Zinc Proteins
|Parameters||it 2||it 8|
Validation and Comparison
Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search.
In general, one can distinguish between two kinds of proteins, that frequently are identified by the sequence searches:
- Succinylglutamate Desuccinylases
A simple blast search yields only about 90 significant hits if one considers a threshold of 10e-10 as a significance cutoff. As one can see in Figure <xr id="blastp_comp"/>, the restriction of the E-Value results in less hits with a low sequence similarity.
For all 94 hits found with an E-Value cutoff of 10e-10, there are annotated GO terms. Furthermore all founds hits share the GO term "hydrolase activity, acting on ester bonds" and "metabolic process". Also, as one can see in <xr id="go_blastp_10e10" /> all hits share the most GO terms with Aspartacylase. Again, "Zinc binding" could also be associated with Aspartoacylase. Therefore, all GO terms that are found more than 5 times, are also associated with Aspartoacylase. The results are more accurate concerning shared GO terms with Aspartoacylase. This is what one would expect when restricting the EValue for finding the closer related proteins.
Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. This outcome is also represented in the distribution of sequence identities. As one can see in figure ??, running PSI-Blast with 10 iterations results in hits with a lower sequence identity to our query sequence than the hits from the run with 2 iterations.
When restricting the E-Value Cutoff for the profile built-up, we found that more hits are classified as Aspartoacylases than as Succinylglutamate Desuccinylases. The running time, as well as the E-Values of the resulting hits did not change significantly. The majority of the results from the runs with only two iterations, has moderate sequence identities with a broad distribution between 10% and 50%. In contrast, the results from the run with 10 iterations split up into two groups of hits which form cluster at about 15% and 35% sequence identity. This difference is also represented in the E_Value distribution. The runs with 10 iterations result in Hits with moderate E_Values between -200 and -40 log(E_Values). The runs with 10 iterations in contrast result in many low significant hits (log(E_Value > -20)) and a variety of high significant hits.
For the run with 2 iterations and the default cutoff value of 0.002, we received 915 hits. We considered 597 hits as significant (E-Value cutoff 2e-3). 586 of these significant proteins have GO terms annotated. For the run with 2 iterations and a EValue cutoff of 10e-10, we received 835 hits out of which we considered 502 proteins as significant. 496 proteins have GO terms annotated.
Running HHBlits with 2 iterations yields a small amount of hits (270) with very low (2e-110) and very high (0.0011) E-Values. To increase the amount of hits, we repeated the HHBlits search with the maximum amount of 8 iterations which resulted in a broader output with more Hits with lower averaged E-Values (compare figure ??). Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see Figure ??).
As one can see in Figure ??, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent).
We tried to further validate the sequence search hits via structural similarity. Unfortunately none of the resulting Hits was a PDB Hit. Furthermore we tried to map the sequence identifiers against the UniProtKB/Swiss-Prot PDB cross-references (http://www.uniprot.org/docs/pdbtosp.txt). Again, this mapping yielded no results, which is why we cannot include any structural information for our ongoing research. When inspecting the annotation for the sequence hits, we already found, that the majority of the hits codes for Aspartoacylases or respectively the highly related protein Succinylglutamate Desuccinylases. Since there already exists a crystal structure of the human Aspartoacylase, it is only reasonable that one will not find other structures for this class of proteins. Additionally, a huge amount of hits codes for not yet characterized proteins, which also will hardly be an interesting target for crystallization.
Multiple Sequence Alignments
For generating our dataset for the MSA we clustered all Hits into Sequence Identity groups:
- >90%: 1
- 60-89%: 59
- 40-59%: 197
- 20-39%: 1141
Since we only got one hit with an sequence Identity >90% we decided to group out hits as follows: three groups of sequences with eight members each:
We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).
id eVal identity # 60-99% sequence identity Q8BZC2 1.7e-25 90 E1BVP5 e-140 72 H2RVG4 e-141 63 G3VM93 e-105 72 F6ZFQ0 e-139 78 F8WFU8 e-145 86 Q28C61 e-132 68 H2M5L4 e-133 64 # 40-59% sequence identity G5BTW1 e-133 43 G6FRX8 e-103 39 F7NV91 e-112 39 G1Q6P7 e-120 42 H0WH68 e-135 44 F2PFG6 e-119 40 H2MX25 5e-81 40 Q1Z2X2 e-115 38 # 20-39% sequence identity Q2F9Q7 e-109 31 Q8YQC1 e-117 41 E1SMZ8 e-108 39 D7E1T3 e-110 36 A5GQV1 7e-92 33 E8LP14 e-107 31 F9TUZ3 e-106 30 A6VUE4 e-101 35
All in all the three Alignment methods yield comparable results. One can identify several conserved regions. Especially the two groups with sequence identities <60% show very similar MSAs.
There are three strongly conserved motivs located in the first half of the sequences:
For the second half of the sequence alignments there is no clear concensus about reserved motifs, but several residues are strongly conserved and may be of functional or structural importance.
In the alignment of the >60% group the first two motifs are not colored in the alignment. This is due to two very short sequences which produce gaps in the alignment and thus lower the consensus.
clustalw -align -infile=./db_over60.fa -outfile=./clustalw_msa_60.aln
Concerning the wildtype human Aspartoacylase
The three identified motifs can also be foung in the wildtype protein (compare coloring of sequence on top of page). We colored the respective residues in the structure. They all are positioned in the same region of the protein and thus might implicate an important functional region.