Difference between revisions of "Canavan Task 2 - Sequence alignments"

Revision as of 15:12, 31 August 2012

Protocol

Further information can be found in the protocol.

GO Term Enrichment

In the following we are performing different sequence searches with the protein Aspartoacylase (UniProt ID: P45381). In order to validate the found hits, we are looking for common GO classifications of the hits with the query sequence.

For our protein Aspartoacylase there are 17 annotated GO terms (using EMBLs QuickGO):

**<xr nolink id="aspa_go_terms"/>**
The 17 annotated GO terms for Aspartoacylase(P45381)
	GO ID	GO Name
Cellular Compartment	GO:0005634	nucleus (3X)
	GO:0005634	cytoplasm (4X)
Biological Process	GO:0006533	aspartate catabolic process
	GO:0008152	metabolic process
	GO:0022010	central nervous system myelination
	GO:0048714	positive regulation of oligodendrocyte differentiation
Molecular Function	GO:0046872	metal ion binding
	GO:0004046	aminoacylase activity
	GO:0016787	hydrolase activity
	GO:0016788	hydrolase activity, acting on ester bonds
	GO:0016811	hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides
	GO:0019807	aspartoacylase activity

</figtable>

Pairwise Sequence Search

BLASTP

The blast search with default parameters yielded 196 results when using the default E-Value cutoff of 10. Being more restrictive and considering only hits with E-Values less than 0.002 we get 104 hits. The best alignment is with an uncharacterized Protein from Rattus norvegicus and has an E-Value of e-155. Most of the resulting proteins are Aspartoacylases of other species. There also are a lot of uncharacterized proteins, just as our best hit. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which are in the same protein family (Desuccinylase / Aspartoacylase family) and catalyze a reaction similar to Aspartoacylase.

**<xr nolink id="aspa_blastp"/>**
Results from a BlastP search with Aspartoacylse using different EValues.
E-Value	default E-Value (10)
results (EVal: 10)	196
results (EVal: 2e-3)	104
best E-Value	1e-155

</figtable>

PSIBLAST

We run PSIBlast with four different parameter combinations:

2 iterations (j=2), default E-Value cutoff for inclusion of sequences into profile (h=0.002)
2 iterations (j=2), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10)
10 iterations (j=10), default E-Value cutoff for inclusion of sequences into profile (h=0.002)
10 iterations (j=10), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10)

Again we considered only hits with an E-Value up to 0.002 as significant. The Psiblast run with default parameters (2 iterations, EValue 0.002) results in an amount of 597 hits. The most significant hit again is the uncharacterized protein of Rattus Norvegicus with an E-Value of e-142. When restricting the search to include only sequences with an E-Value of up to 10e-10, we still get 502 hits. The best hit still is the Rattus Norvegicus protein.

In contrast to the simple Blast search, the PsiBlast runs with two iterations find more distant related proteins. This can be seen in the great amount of Succinylglutamate desuccinylasen that are found (though with higher E-Values). They belong, as already mentioned, to the same Pfam family as Aspartoacylase.

Increasing the number of iterations obviously results in many more hits. When using a less restrictive E-Value more than 3000 hits are found against 1500 when using a more restrictive E-Value. Interestingly, the best hits (Rattus Norvegicus again) are less significant (~e-70) than for the PsiBlast search with only 2 iterations (~e-140). The majority of found proteins now are Succinylglutamate Desuccinylases, even among the most significant hits (first Succinylglutamate Desuccinylase is ranked 15th). Only among the first 15 significant results orthologs of Aspartoacylase can be found. Additionally, even further relatives are found, like Zinc Carboxypeptidases, Carboxypeptidases, Endopeptidases, etc.

**<xr nolink id="aspa_psiblast"/>**
Results from a psiblast search with Aspartoacylase(UniProt: P45381) using different EValues and iterations.
Parameters	it2, def E-Value (h=2e-3)	it2, E-Value h=10e-10	it10 def E-Value (h=2e-3)	it10 E-Value h=10e-10
time	~2m30	~2m30	~25m	~22m
best E-Value	1e-142	1e-145	2e-46	e-68
results (EVal: 2e-3)	597	502	3211	1515

</figtable>

HHBLITS

commentmixed results with Aspartoacylases and Succivery varying results: Aspartoacylasen, Succinylasen, Zinc Proteins

**<xr nolink id="aspa_psiblast"/>**
Results from a HHblits search with Aspartoacylse using different EValues and iterations.
Parameters	it 2	it 8
time	2m50	~6m
results	274	500
best E-Value	2e-110	2.9e-68
worst E-Value	0.0011	9.5e-09

</figtable>

Validation and Comparison

Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search.

In general, one can distinguish between two kinds of proteins, that frequently are identified by the sequence searches:

Aspartoacylases
Succinylglutamate Desuccinylases

BlastP

With a simple blast search we were able to identify the closest related sequences. The most significant hit(Rattus Norvegicus) has a sequence identity of 82%. In <xr id="blastp_comp"/> the distribution of the sequence identity of all hits with E-Value < 0.002 is depicted. As one can easily see, there are only few hits with high sequence identities and the majority of hits has sequence identities of about 30%.

192 out of the 196 hits have annotated GO terms. As one can see in <xr id="go_blastp_def" /> all hits share many GO terms with Aspartacylase. The most common GO terms shared with Aspartoacylase are "metabolic process" (185x) and "hydrolase activity, acting on ester bonds" (184x). The GO term "Zinc binding" is not officially associated with Aspartoacylase, yet we know that is does bind zinc. As already mentioned, some of the found proteins have a succinylglutamate desuccinylase activity. They belong to the same family as Aspartoacylase and their occurence is not surpising.

For all 94 hits found with an E-Value cutoff of 10e-10, there are annotated GO terms. Furthermore all founds hits share the GO term "hydrolase activity, acting on ester bonds" and "metabolic process". Also, as one can see in <xr id="go_blastp_10e10" /> all hits share the most GO terms with Aspartacylase. Again, "Zinc binding" could also be associated with Aspartoacylase. Therefore, all GO terms that are found more than 5 times, are also associated with Aspartoacylase. The results are more accurate concerning shared GO terms with Aspartoacylase. This is what one would expect when restricting the EValue for finding the closer related proteins.

<xr nolink id="blastp_comp"/>
Distribution of Sequence Identity of the proteins found with a simple BlastP run with the Aspartoacylase sequence (P45381).

</figure>

</figtable>

**<xr nolink id="aspa_blastp"/>**
Results from a BlastP search with Aspartoacylse using different EValues.
E-Value	default (10)	10e-10
GO terms	<figure id="go_blastp_def"> <xr nolink id="go_blastp_def"/> Go Term enrichment for the hits found with BlastP using the default E-Value cutoff of 10. All Go terms with occurence more than once are shown. Go Terms that are identical with the Go annotation for Aspartoacylase (P45381) are colored red. </figure>	<figure id="go_blastp_10e10"> <xr nolink id="go_blastp_10e10"/> Go Term enrichment for the hits found with BlastP using an EValue cutoff of 10e-10. All Go terms with occurence more than once are shown. Go Terms that are identical with the Go annotation for Aspartoacylase (P45381) are colored red. </figure>

</figtable>

Psi Blast

Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. This outcome is also represented in the distribution of sequence identities. As one can see in <xr id="PSI_10e10_seqd"/>, running PSI-Blast with 10 iterations firstly results in more significant hits and secondly most hits have lower sequence identity compared to the run with two iterations. Comparing the effect of the E-Value restriction for inclusion of sequences into the profile, one finds that hits with lower sequence identity are included in the final hit list when the default cutoff of 0.002 is used. Furthermore, when the more restrictive cutoff is used, simply less hits are being found (see <xr id="PSI_10it_seqid"/>).

When restricting the E-Value Cutoff for the profile built-up, we found that more hits are classified as Aspartoacylases than as Succinylglutamate Desuccinylases. The running time, as well as the E-Values of the resulting hits did not change significantly. The majority of the results from the runs with only two iterations, has moderate sequence identities with a broad distribution between 10% and 50%. In contrast, the results from the run with 10 iterations split up into two groups of hits which form cluster at about 15% and 35% sequence identity. This difference is also represented in the E_Value distribution. The runs with 10 iterations result in Hits with moderate E_Values between -200 and -40 log(E_Values). The runs with 10 iterations in contrast result in many low significant hits (log(E_Value > -20)) and a variety of high significant hits.

<xr nolink id="PSI_10e10_seqd"/>
Distribution of Sequence Id between Psi-Blast runs with 2 iterations vs 10 iterations (using E-Value 10e-10)

</figure>

<xr nolink id="PSI_2it_seqid"/>
Distribution of Sequence Id for Psi-Blast runs with 2 iterations with different E-Values (def E-Value vs E-Value of10e-10)

</figure>

<xr nolink id="PSI_10it_seqid"/>
Distribution of Sequence Id for Psi-Blast runs with 10 iterations with different E-Values (def E-Value vs E-Value of10e-10)

</figure>

<xr nolink id="eval_distri_psiblast"/>
Distribution of logarithmic E_Values for the four different PSIBlast runs

</figure>

For the run with 2 iterations and the default cutoff value of 0.002, we received 915 hits. We considered 597 hits as significant (E-Value cutoff 2e-3). 586 of these significant proteins have GO terms annotated. For the run with 2 iterations and a EValue cutoff of 10e-10, we received 835 hits out of which we considered 502 proteins as significant. 496 proteins have GO terms annotated.

For 10 iterations with 10e-10 there were 1515 significant hits and 1461 have annotated GO terms. 3152

<xr nolink id="psi_2it_def"/>
Distribution of Sequence Id between Psi-Blast runs with 2 iterations vs 10 iterations (using E-Value 10e-10)

</figure>

<xr nolink id="psi_2it_10e10"/>
Distribution of Sequence Id for Psi-Blast runs with 2 iterations with different E-Values (def E-Value vs E-Value of10e-10)

</figure>

<xr nolink id="psi_10it_def"/>
Distribution of Sequence Id for Psi-Blast runs with 10 iterations with different E-Values (def E-Value vs E-Value of10e-10)

</figure>

<xr nolink id="psi_10it_10e10"/>
Distribution of logarithmic E_Values for the four different PSIBlast runs

</figure>

HHBlits

Running HHBlits with 2 iterations yields a small amount of hits (270) with very low (2e-110) and very high (0.0011) E-Values. To increase the amount of hits, we repeated the HHBlits search with the maximum amount of 8 iterations which resulted in a broader output with more Hits with lower averaged E-Values (compare figure ??). Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see Figure ??).

Figure ??
Sequence identity distributions of HHBlits run with 2 and with 8 iterations.

Figure ??
logarithmic E_Value distributions of HHBlits run with 2 and with 8 iterations.

Overlap

As one can see in Figure ??, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent).

Figure ??
Distribution of overlapping Hits for the eight different used Sequence Searches.

Default E-Values: as could be expected, the normal BLAST search is mostly contained in the PsiBLAST search with two iterations. HHBlits found a large number of different hits, with only 48 out of 274 common hits in common with the BLAST searches.

Taking PsiBLAST with 10 iterations into account brings in a large number of common sequences among the three searches (110), which could be interesting since there seems to be high conversation among them.

Strict E-Values for PsiBLAST and default E-Value for HHBlits with 2 iterations: The number of common hits among all three is now substantially lower, while PsiBLAST with two and ten iterations share a great number of their hits.

Increasing the number of HHBlits-iteration yields more hits for HHBlits, but does not increase the number of common hits with PSI-Blast in 2 or 10 iterations. However, 10 sequences are common and could be interesting for further investigation.

Further Evaluation

We tried to further validate the sequence search hits via structural similarity. Unfortunately none of the resulting Hits was a PDB Hit. Furthermore we tried to map the sequence identifiers against the UniProtKB/Swiss-Prot PDB cross-references (http://www.uniprot.org/docs/pdbtosp.txt). Again, this mapping yielded no results, which is why we cannot include any structural information for our ongoing research. When inspecting the annotation for the sequence hits, we already found, that the majority of the hits codes for Aspartoacylases or respectively the highly related protein Succinylglutamate Desuccinylases. Since there already exists a crystal structure of the human Aspartoacylase, it is only reasonable that one will not find other structures for this class of proteins. Additionally, a huge amount of hits codes for not yet characterized proteins, which also will hardly be an interesting target for crystallization.

Multiple Sequence Alignments

For generating our dataset for the MSA we clustered all Hits into Sequence Identity groups:

>90%: 1
60-89%: 59
40-59%: 197
20-39%: 1141

Since we only got one hit with an sequence Identity >90% we decided to group out hits as follows: three groups of sequences with eight members each:

60-99%
40-59%
20-39%

We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).

id       eVal  identity

# 60-99% sequence identity
Q8BZC2	1.7e-25	90
E1BVP5	e-140	72
H2RVG4	e-141	63
G3VM93	e-105	72
F6ZFQ0	e-139	78
F8WFU8	e-145	86
Q28C61	e-132	68
H2M5L4	e-133	64

# 40-59% sequence identity
G5BTW1	e-133	43
G6FRX8	e-103	39
F7NV91	e-112	39
G1Q6P7	e-120	42
H0WH68	e-135	44
F2PFG6	e-119	40
H2MX25	5e-81	40
Q1Z2X2	e-115	38

# 20-39% sequence identity
Q2F9Q7	e-109	31
Q8YQC1	e-117	41
E1SMZ8	e-108	39
D7E1T3	e-110	36
A5GQV1	7e-92	33
E8LP14	e-107	31
F9TUZ3	e-106	30
A6VUE4	e-101	35

General Results

All in all the three Alignment methods yield comparable results. One can identify several conserved regions. Especially the two groups with sequence identities <60% show very similar MSAs.

There are three strongly conserved motivs located in the first half of the sequences:

GGTHGNE
DLNR
DLHNT

For the second half of the sequence alignments there is no clear concensus about reserved motifs, but several residues are strongly conserved and may be of functional or structural importance.

In the alignment of the >60% group the first two motifs are not colored in the alignment. This is due to two very short sequences which produce gaps in the alignment and thus lower the consensus.

ClustalW

command: clustalw -align -infile=./db_over60.fa -outfile=./clustalw_msa_60.aln

Jalview Representation of the ClustalW Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the ClustalW Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the ClustalW Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).

TCoffee

Jalview Representation of the T-Coffee Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the T-Coffee Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the T-Coffee Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).

Muscle

Jalview Representation of the Muscle Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the Muscle Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).

Jalview Representation of the Muscle Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).

Concerning the wildtype human Aspartoacylase

The three identified motifs can also be foung in the wildtype protein (compare coloring of sequence on top of page). We colored the respective residues in the structure. They all are positioned in the same region of the protein and thus might implicate an important functional region.

Human Aspartoacylse (PDB 2O53) with the motif residues DLHNT

Human Aspartoacylse (PDB 2O53) with the motif residues DLNR

Human Aspartoacylse (PDB 2O53) with the motif residues GGLNR.

Difference between revisions of "Canavan Task 2 - Sequence alignments"

Revision as of 15:12, 31 August 2012

Contents

Protocol

GO Term Enrichment

Pairwise Sequence Search

BLASTP

PSIBLAST

HHBLITS

Validation and Comparison

BlastP

Psi Blast

HHBlits

Overlap

Further Evaluation

Multiple Sequence Alignments

General Results

ClustalW

TCoffee

Muscle

Concerning the wildtype human Aspartoacylase

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools

@@ Line 199: / Line 199: @@
 Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome.
-This outcome is also represented in the distribution of sequence identities. As one can see in figure ??, running PSI-Blast with 10 iterations results in hits with a lower sequence identity to our query sequence than the hits from the run with 2 iterations.
+This outcome is also represented in the distribution of sequence identities. As one can see in <xr id="PSI_10e10_seqd"/>, running PSI-Blast with 10 iterations firstly results in more significant hits and secondly most hits have lower sequence identity compared to the run with two iterations.
+Comparing the effect of the E-Value restriction for inclusion of sequences into the profile, one finds that hits with lower sequence identity are included in the final hit list when the default cutoff of 0.002 is used. Furthermore, when the more restrictive cutoff is used, simply less hits are being found (see <xr id="PSI_10it_seqid"/>).