Difference between revisions of "Canavan Task 2 - Sequence alignments"
(→Psi Blast) |
(→HHBLITS) |
||
(147 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | ==Protocol== |
||
− | == Task 2 : Canavans Disease == |
||
− | Sorry, guys, we're a bit behind schedule! Hope to have everything finished before 11pm tonight (Monday) and hope that's early enough for you to read. |
||
− | Sorry again! |
||
− | Susi and Fanny |
||
+ | Further information can be found in the [[CD_task2_protocol|protocol]]. |
||
− | <!-- ich sitz sowieso mit wide-open end an unserem also braucht ihr euch gar keine sorgen machen :) lg alice --> |
||
− | === Sequence === |
||
+ | == <div id="go_terms">GO Term Enrichment</div> == |
||
− | The native ASPA sequence that we used for the current task is shown below: |
||
+ | In the following we are performing different sequence searches with the protein Aspartoacylase (UniProt ID: P45381). In order to validate the found hits, we are looking for common GO classifications of the hits with the query sequence. |
||
− | UniProt: P45381 |
||
+ | For our protein Aspartoacylase there are 17 annotated GO terms (using [http://www.ebi.ac.uk/QuickGO/ EMBLs QuickGO]): |
||
− | <code> |
||
− | >hsa:443 ASPA, ACY2, ASP; aspartoacylase; K01437 aspartoacylase [EC:3.5.1.15] (A)<br> |
||
− | MTSCHIAEEHIQKVAIFGGTHGNELTGVFLVKHWLENGAEIQRTGLEVKPFITNPRAVKK<br> |
||
− | CTRYIDCDLNRIFDLENLGKKMSEDLPYEVRRAQEINHLFGPKDSEDSYDIIFDLHNTTS<br> |
||
− | NMGCTLILEDSRNNFLIQMFHYIKTSLAPLPCYVYLIEHPSLKYATTRSIAKYPVGIEVG<br> |
||
− | PQPQGVLRADILDQMRKMIKHALDFIHHFNEGKEFPPCAIEVYKIIEKVDYPRDENGEIA<br> |
||
− | AIIHPNLQDQDWKPLHPGDPMFLTLDGKTIPLGGDCTVYPVFVNEAAYYEKKEAFAKTTK<br> |
||
− | LTLNAKSIRCCLH </code> |
||
+ | <figtable id="aspa_go_terms"> |
||
+ | <table cellspacing=0 align="center" cellpadding=5> |
||
+ | <caption align="center"><b><xr nolink id="aspa_go_terms"/></b> <br> The 17 annotated GO terms for Aspartoacylase(P45381) </caption> |
||
+ | <tr><td style="border-right:solid;border-bottom:solid" align="left"></td><td style="border-bottom:solid;" align="left"><b>GO ID</b></td><td style="border-bottom:solid;" align="left"><b>GO Name</b></td></tr> |
||
− | ---- |
||
+ | <tr><td style="border-right:solid;" align="left"><b>Cellular Compartment</b> |
||
− | === Search=== |
||
+ | </td><td> GO:0005634 </td> |
||
+ | <td>nucleus (3X)</td> |
||
+ | <tr><td style="border-bottom:solid;border-right:solid;" align="left"></td> |
||
+ | <td style="border-bottom:solid;" align="left">GO:0005634 </td> |
||
+ | <td style="border-bottom:solid;" align="left">cytoplasm (4X)</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"><b>Biological Process</b></td><td>GO:0006533</td><td> aspartate catabolic process</td></tr> |
||
− | ==== BLASTP ==== |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0008152</td><td> metabolic process</td></tr> |
||
− | We ran BlastP on student machines with the big_80 as a reference database. |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0022010</td><td> central nervous system myelination</td></tr> |
||
+ | <tr><td style="border-bottom:solid;border-right:solid;" align="left"></td><td style="border-bottom:solid;" align="left">GO:0048714</td><td style="border-bottom:solid;" align="left"> positive regulation of oligodendrocyte differentiation</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"><b>Molecular Function</b></td><td>GO:0046872 </td><td>metal ion binding</td></tr> |
||
− | Command: |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0004046</td><td> aminoacylase activity</td></tr> |
||
− | <code>blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o blastp_p45381_wt_big80.out</code> |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0016787</td><td> hydrolase activity</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0016788</td><td> hydrolase activity, acting on ester bonds</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0016811</td><td> hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"></td><td>GO:0019807</td><td> aspartoacylase activity</td></tr> |
||
+ | </table> |
||
+ | </figtable> |
||
+ | == Pairwise Sequence Search== |
||
+ | === BLASTP === |
||
− | <table border=1 cellspacing = 0> |
||
+ | |||
− | <tr><td><b>Parameters</b></td><td>default E-Value = 10</td><td> E-Value 10e-10</td></tr> |
||
+ | <table> |
||
− | <tr><td><b>results</b></td><td>196</td><td>94</td></tr> |
||
+ | <tr> |
||
− | <tr><td><b>best E-Value</b></td><td>1e-155</td><td>1e-155</td></tr> |
||
+ | <td width="70%"> |
||
− | <tr><td><b>worst E-Value</b></td><td>9.6</td><td>e-15</td></tr> |
||
+ | The blast search with default parameters yielded 196 results when using the default E-Value cutoff of 10. Being more restrictive and considering only hits with E-Values less than 0.002 we get 104 hits. The best alignment is with an uncharacterized Protein from ''Rattus norvegicus'' and has an E-Value of e-155. |
||
− | <tr><td><b>comment</b></td><td width=300>Most of the resulting proteins are Aspartoacylases of other species. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which catalyze a reaction similar to Aspartoacylase.</td><td>The results are the same as for the first run, just with an earlier cutoff</td></tr> |
||
+ | Most of the resulting proteins are Aspartoacylases of other species. There also are a lot of uncharacterized proteins, just as our best hit. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which are in the same protein family ([http://pfam.sanger.ac.uk/family/PF04952|Succinylglutamate Desuccinylase / Aspartoacylase family]) and catalyze a reaction similar to Aspartoacylase. |
||
+ | </td> |
||
+ | |||
+ | <td> |
||
+ | <figtable id="aspa_blastp"> |
||
+ | <table cellspacing=0 align="center" cellpadding=5> |
||
+ | <caption align="center"><b><xr nolink id="aspa_blastp"/></b> <br> Results from a BlastP search with Aspartoacylse using different EValues.</caption> |
||
+ | <tr> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="left" width="160px"> <b>E-Value</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"> default E-Value (10)</td> |
||
+ | |||
+ | </tr> |
||
+ | <tr> |
||
+ | <td style="border-right:solid;" align="left"><b>results (EVal: 10)</b></td><td style="border-right:solid;" align="center">196</td> |
||
+ | </tr> |
||
+ | <tr> |
||
+ | <td style="border-right:solid;" align="left"><b>results (EVal: 2e-3)</b></td><td style="border-right:solid;" align="center">104</td> |
||
+ | </tr> |
||
+ | <tr> |
||
+ | <td style="border-right:solid;" align="left"><b>best E-Value</b></td><td style="border-right:solid;" align="center">1e-155</td> |
||
+ | </tr> |
||
</table> |
</table> |
||
+ | </figtable> |
||
+ | </td></tr></table> |
||
− | ====PSIBLAST==== |
||
+ | ===PSIBLAST=== |
||
− | PSIBlast was used in the same fashion as BLAST, with the big_80 as the background database. |
||
− | Commands: |
||
+ | We run PSIBlast with four different parameter combinations: |
||
− | * Running 2 iterations and default E-Value 0.002 |
||
+ | * 2 iterations (j=2), default E-Value cutoff for inclusion of sequences into profile (h=0.002) |
||
− | **<code> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it2_p45381_wt_big80.out -j 2</code> |
||
+ | * 2 iterations (j=2), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10) |
||
+ | * 10 iterations (j=10), default E-Value cutoff for inclusion of sequences into profile (h=0.002) |
||
+ | * 10 iterations (j=10), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10) |
||
+ | Again we considered only hits with an E-Value up to 0.002 as significant. The Psiblast run with default parameters (2 iterations, EValue 0.002) results in an amount of 597 hits. The most significant hit again is the uncharacterized protein of ''Rattus Norvegicus'' with an E-Value of e-142. When restricting the search to include only sequences with an E-Value of up to 10e-10, we still get 502 hits. The best hit still is the ''Rattus Norvegicus'' protein. |
||
+ | In contrast to the simple Blast search, the PsiBlast runs with two iterations find more distant related proteins. This can be seen in the great amount of Succinylglutamate desuccinylasen that are found (though with higher E-Values). They belong, as already mentioned, to the same Pfam family as Aspartoacylase. |
||
− | *2 iterations, more strict E-value cutoff of 10E-10 |
||
− | **<code> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it2_h10e10_p45381_wt_big80.out -j 2 -h 10e-10</code> |
||
+ | Increasing the number of iterations obviously results in many more hits. When using a less restrictive E-Value more than 3000 hits are found against 1500 when using a more restrictive E-Value. Interestingly, the best hits (''Rattus Norvegicus'' again) are less significant (~e-70) than for the PsiBlast search with only 2 iterations (~e-140). The majority of found proteins now are Succinylglutamate Desuccinylases, even among the most significant hits (first Succinylglutamate Desuccinylase is ranked 15th). Only among the first 15 significant results orthologs of Aspartoacylase can be found. Additionally, even further relatives are found, like Zinc Carboxypeptidases, Carboxypeptidases, Endopeptidases, etc. |
||
− | *10 iterations, default Evalue 0.002 |
||
− | ** <code> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it10_p45381_wt_big80.out -j 10</code> |
||
+ | <figtable id="aspa_psiblast"> |
||
+ | <table cellspacing=0 align="center" cellpadding=5> |
||
+ | <caption align="center"><b><xr nolink id="aspa_psiblast"/></b> <br> Results from a psiblast search with Aspartoacylase(UniProt: P45381) using different EValues and iterations.</caption> |
||
+ | <tr> |
||
− | *10 iterations, E-value cutoff 10E-10 |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="left"> <b>Parameters</b></td> |
||
− | ** <code> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it10_h10e10_p45381_wt_big80.out -j 10 -h 10e-10</code> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it2, def E-Value (h=2e-3)</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it2, E-Value h=10e-10</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it10 def E-Value (h=2e-3)</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it10 E-Value h=10e-10</b></td> |
||
+ | </tr> |
||
+ | |||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>time</b></td> <td style="border-right:solid;" align="center">~2m30</td> <td style="border-right:solid;" align="center">~2m30</td> <td style="border-right:solid;" align="center">~25m</td> <td style="border-right:solid;" align="center">~22m</td></tr> |
||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>best E-Value</b></td><td style="border-right:solid;" align="center">1e-142</td> <td style="border-right:solid;" align="center">1e-145</td> <td style="border-right:solid;" align="center">2e-46</td> <td style="border-right:solid;" align="center">e-68</td></tr> |
||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>results (EVal: 2e-3)</b></td> <td style="border-right:solid;" align="center">597</td> <td style="border-right:solid;" align="center">502</td> <td style="border-right:solid;" align="center">3211</td> <td style="border-right:solid;" align="center">1515</td></tr> |
||
+ | <tr><td style="border-right:solid;" align="left"><b>hits with GO terms </b></td> <td style="border-right:solid;" align="center">586</td> <td style="border-right:solid;" align="center">496</td> <td style="border-right:solid;" align="center">3152</td> <td style="border-right:solid;" align="center">1461</td></tr> |
||
− | <table border=1 cellspacing=0> |
||
− | <tr><td><b>Parameters</b></td> <td>it2, def E-Value (0.002)</td> <td>it2 E-Value 10e-10</td> <td>it10 def E-Value (0.002)</td><td>it10 E-Value 10e-10</td></tr> |
||
− | <tr><td><b>time</b></td> <td>~2m30</td> <td>~2m30</td> <td>~10m</td> <td>time: ~10m</td></tr> |
||
− | <tr><td><b>results</b></td> <td>500</td> <td>93</td> <td>500</td> <td>500</td></tr> |
||
− | <tr><td><b>best E-Value</b></td><td>1e-142</td> <td>1e-145</td> <td>5e-70</td> <td>7e-70</td></tr> |
||
− | <tr><td><b>worst E-Value</b></td><td>3e-4</td> <td> 2e-29</td> <td>8e-38</td> <td>1e-38</td></tr> |
||
− | <tr><td><b>comments</b></td><td>Results with best EValues are mostly Aspartoacylases, Sequences previously not found are mostly Succinylglutamate Desuccinylases</td><td>results mainly Aspartoacylases</td><td>- converged after 8 rounds <br> - most significant results include more Succinylglutamate Desuccinylases than Aspartoacylases</td><td>- all 10 iterations were done (no early convergence)<br> - aspartoacylases slightly more frequent in lower E-Values (< E-58), but no significant difference in E-Values for aspas and succis</td></tr> |
||
</table> |
</table> |
||
+ | </figtable> |
||
− | + | ===HHBLITS=== |
|
+ | We used the same parameters for the HHblits search as we used for PsiBlast. This means: |
||
− | Run HHBlits on student machines with Uniprot20 database. |
||
+ | * 2 iterations (n=2), E-Value cutoff for inclusion in result alignment (e=0.002) |
||
− | Commands: |
||
+ | * 2 iterations (n=2), strict E-Value cutoff for inclusion in result alignment (e=10e-10) |
||
+ | * 8 iterations (n=8), E-Value cutoff for inclusion in result alignment (e=0.002) |
||
+ | * 8 iterations (n=8), strict E-Value cutoff for inclusion in result alignment (e=10e-10) |
||
+ | For the runs with 2 iterations there are only few clusters containing Aspartoacylases. Most other cluster are comprised of uncharacterized proteins or Succinylases. Yet, the best hit in both runs with different E-Values contains Aspartoacylase proteins, even though the E-Values differ for this best hit cluster in both runs. |
||
− | *2 iterations: |
||
− | **<code> hhblits -i P45381_wt.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o hhblits_p45381_def.out </code> |
||
+ | For the runs with 8 iterations, in case of an EValue of 0.002 much more clusters are found, that contain very distantly related proteins. The best hit again, is a cluster containing Aspartoacylases. When using a stricter EValue of 10e-10 one receives a smaller amount of hits (even smaller than for 2 iterations and EValue = 0.002), with a lot of closer related sequences, mostly Succinylases. |
||
− | *8 iterations: |
||
− | **<code> hhblits -i P45381_wt.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -n 8 -o hhblits_p45381_n10.out </code> |
||
+ | One has to notice, that for all runs, the first five hit clusters are always the same, just with a slightly different ranking. |
||
− | -n number of iterations (def 2)<br> |
||
+ | |||
+ | <figtable id="aspa_hhblits"> |
||
+ | <table cellspacing=0 align="center" cellpadding=5> |
||
+ | <caption align="center"><b><xr nolink id="aspa_hhblits"/></b> <br> Results from a HHBlits search with Aspartoacylase(UniProt: P45381) using different E-Values and iterations.</caption> |
||
+ | |||
+ | <tr> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="left"> <b>Parameters</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it2, E-Value e=2e-3</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it2, E-Value e=10e-10</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it8, E-Value e=2e-3</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"><b> it8, E-Value e=10e-10</b></td> |
||
+ | </tr> |
||
+ | |||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>time</b></td> <td style="border-right:solid;" align="center">~2m20</td> <td style="border-right:solid;" align="center">~4m</td> <td style="border-right:solid;" align="center">~11m10</td> <td style="border-right:solid;" align="center">~10m55</td></tr> |
||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>best E-Value</b></td><td style="border-right:solid;" align="center">1e-107</td> <td style="border-right:solid;" align="center">2e-141</td> <td style="border-right:solid;" align="center">5.1e-74</td> <td style="border-right:solid;" align="center">5.5e-74</td></tr> |
||
+ | |||
+ | <tr><td style="border-right:solid;" align="left"><b>results (EVal: 2e-3)</b></td> <td style="border-right:solid;" align="center">305</td> <td style="border-right:solid;" align="center">76</td> <td style="border-right:solid;" align="center">776</td> <td style="border-right:solid;" align="center">246</td></tr> |
||
− | <table border=1 cellspacing = 0> |
||
− | <tr><td><b>Parameters</b></td><td>it 2</td><td> it 8</td></tr> |
||
− | <tr><td><b>time</b></td><td>2m50</td><td>~6m</td></tr> |
||
− | <tr><td><b>results</b></td><td>274</td><td>500</td></tr> |
||
− | <tr><td><b>best E-Value</b></td><td>2e-110</td><td>2.9e-68</td></tr> |
||
− | <tr><td><b>worst E-Value</b></td><td>0.0011</td><td>9.5e-09</td></tr> |
||
− | <tr><td><b>comment</b></td><td width=300>mixed results with Aspartoacylases and Succi</td><td>very varying results: Aspartoacylasen, Succinylasen, Zinc Proteins</td></tr> |
||
</table> |
</table> |
||
+ | </figtable> |
||
− | == |
+ | ==Validation and Comparison== |
Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search. |
Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search. |
||
Line 103: | Line 156: | ||
* Succinylglutamate Desuccinylases |
* Succinylglutamate Desuccinylases |
||
− | ===Overlap=== |
||
− | As one can see in Figure ??, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent). |
||
− | [[File:seq_overlap.png|thumb|300px|center|<b>Figure ??</b> <br> Distribution of overlapping Hits for the eight different used Sequence Searches.]] |
||
+ | ===BlastP=== |
||
+ | |||
+ | <figtable id="blastp_comp"> |
||
<table> |
<table> |
||
+ | <tr><td width="60%"> |
||
− | <tr> |
||
+ | With a simple blast search we were able to identify the closest related sequences. The most significant hit(''Rattus Norvegicus'') has a sequence identity of 82%. In <xr id="blastp_comp"/> the distribution of the sequence identity of all hits with E-Value < 0.002 is depicted. As one can easily see, there are only few hits with high sequence identities and the majority of hits has sequence identities of about 30%. |
||
− | <td>[[File:defaults.png|200px|thumb|Default E-Values: as could be expected, the normal BLAST search is mostly contained in the PsiBLAST search with two iterations. HHBlits found a large number of different hits, with only 48 out of 274 common hits in common with the BLAST searches.]]</td> |
||
− | <td>[[File:Psi2_psi10_blast_defs.png|200px|thumb|Taking PsiBLAST with 10 iterations into account brings in a large number of common sequences among the three searches (110), which could be interesting since there seems to be high conversation among them.]]</td> |
||
− | <td>[[File:blast_psi2_e10_hhblits.png|200px|thumb|Strict E-Values for PsiBLAST and default E-Value for HHBlits with 2 iterations: The number of common hits among all three is now substantially lower, while PsiBLAST with two and ten iterations share a great number of their hits. ]]</td> |
||
− | <td>[[File:Psi2_psi10_e10_hhblits_n10.png|200px|thumb|Increasing the number of HHBlits-iteration yields more hits for HHBlits, but does not increase the number of common hits with PSI-Blast in 2 or 10 iterations. However, 10 sequences are common and could be interesting for further investigation.]]</td> |
||
− | </tr></table> |
||
− | ===BlastP=== |
||
+ | 192 out of the 196 hits have annotated GO terms. As one can see in <xr id="go_blastp_def" /> all hits share many GO terms with Aspartacylase. The most common GO terms shared with Aspartoacylase are "metabolic process" (185x) and "hydrolase activity, acting on ester bonds" (184x). The GO term "Zinc binding" is not officially associated with Aspartoacylase, yet we know that is does bind zinc. As already mentioned, some of the found proteins have a succinylglutamate desuccinylase activity. They belong to the same family as Aspartoacylase and their occurence is not surpising. |
||
− | A simple blast search yields only about 90 significant hits if one considers a threshold of 10e-10 as a significance cutoff. |
||
+ | |||
− | As one can see in Figure ??, the restriction of the E-Value results in less hits with a low sequence similarity. |
||
+ | For all 94 hits found with an E-Value cutoff of 10e-10, there are annotated GO terms. Furthermore all founds hits share the GO term "hydrolase activity, acting on ester bonds" and "metabolic process". Also, as one can see in <xr id="go_blastp_10e10" /> all hits share the most GO terms with Aspartacylase. Again, "Zinc binding" could also be associated with Aspartoacylase. Therefore, all GO terms that are found more than 5 times, are also associated with Aspartoacylase. The results are more accurate concerning shared GO terms with Aspartoacylase. This is what one would expect when restricting the EValue for finding the closer related proteins. |
||
− | [[File:CD_blast_seqid_distri.png|thumb|300px|center|Comparison of distribution of Sequence Identiy between the two BlastP runs ]] |
||
+ | |||
+ | </td> |
||
+ | <td> |
||
+ | <figure id="blastp_comp"> |
||
+ | [[File:CD_blast_seqid_distri.png|thumb|350px|center|<b><xr nolink id="blastp_comp"/></b><br>Distribution of Sequence Identity of the proteins found with a simple BlastP run with the Aspartoacylase sequence (P45381). ]]</figure> |
||
+ | </td></tr></table> |
||
+ | </figtable> |
||
+ | |||
+ | |||
+ | |||
+ | <figtable id="go_blastp"> |
||
+ | <table cellspacing=0 align="center" cellpadding=5> |
||
+ | <caption align="center"><b><xr nolink id="aspa_blastp"/></b> <br> Results from a BlastP search with Aspartoacylse using different EValues.</caption> |
||
+ | <tr> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="left" width="100px"> <b>E-Value</b></td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"> default (10)</td> |
||
+ | <td style="border-bottom:solid;border-right:solid;" align="center"> 10e-10</td> |
||
+ | </tr> |
||
+ | |||
+ | <tr> |
||
+ | <td style="border-right:solid;" align="left" valign="top"><b>GO terms</b></td> |
||
+ | <td style="border-right:solid;" align="left" valign="top"> |
||
+ | <figure id="go_blastp_def"> |
||
+ | [[File:cd_go_blastp_def.png|thumb|center|450px|<b><xr nolink id="go_blastp_def"/></b><br>Go Term enrichment for the hits found with BlastP using the default E-Value cutoff of 10. All Go terms with occurence more than once are shown. Go Terms that are identical with the Go annotation for Aspartoacylase (P45381) are colored red.]]</figure> |
||
+ | </td> |
||
+ | <td valign="top" align="left"> |
||
+ | <figure id="go_blastp_10e10"> |
||
+ | [[File:cd_go_blastp_10e10.png|thumb|center|350px|<b><xr nolink id="go_blastp_10e10"/></b><br>Go Term enrichment for the hits found with BlastP using an EValue cutoff of 10e-10. All Go terms with occurence more than once are shown. Go Terms that are identical with the Go annotation for Aspartoacylase (P45381) are colored red.]]</figure> |
||
+ | </td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | </figtable> |
||
===Psi Blast=== |
===Psi Blast=== |
||
Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. |
Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. |
||
− | This outcome is also represented in the distribution of sequence identities. As one can see in |
+ | This outcome is also represented in the distribution of sequence identities. As one can see in <xr id="PSI_10e10_seqd"/>, running PSI-Blast with 10 iterations firstly results in more significant hits and secondly most hits have lower sequence identity compared to the run with two iterations. |
+ | When not restricting the E-Value Cutoff for the profile built-up, we found that hits with lower sequence identity (meaning more distantly related) are included in the final hit list. This goes along with our observation that more hits are classified as Succinylglutamate Desuccinylases than as Aspartoacylases when using the default cutoff at 0.002. Furthermore, when the more restrictive cutoff is used, simply less hits are being found (see <xr id="PSI_10it_seqid"/>). |
||
− | [[File:CD_PSI_10e10_seqd_distri.png|thumb|300px|center|<b>Figure ??</b><br> Distribution of Sequence Id between Psi-Blast runs with 2 iterations vs 10 iterations (using E-Value 10e-10)]] |
||
+ | The majority of the results from the runs with only two iterations, has moderate sequence identities with a broad distribution between 15% and 45%. In contrast, the results from the run with 10 iterations split up into two groups of hits which form cluster at about 10% and between 30% and 40% sequence identity. |
||
− | When restricting the E-Value Cutoff for the profile built-up, we found that more hits are classified as Aspartoacylases than as Succinylglutamate Desuccinylases. The running time, as well as the E-Values of the resulting hits did not change significantly. Interisting is the split |
||
+ | |||
+ | These observations are also represented in the E_Value distribution. For runs with two iterations there are some results covering the range of E-Values between e-145 and e-60 and majority of hits with low E-Values between e-8 and the cutoff at 0.002. The runs with 10 iterations almost exclusively result in lower significant hits. For the run with the restricted cutoff, hits have E-Values ranging from e-60 to the cutoff at 0.002. There is a peak for hits with E-Values just about the cutoff Value of 0.002. The run with the default E-Value cutoff results in hits with E-Values ranging from e-30 to 0.002. |
||
<center> |
<center> |
||
<table> |
<table> |
||
− | <tr |
+ | <tr> |
+ | <td> |
||
− | [[File:CD_PSI_2it_seqid_distri.png |thumb|300px|center|<b>Figure ??</b><br> Distribution of Sequence Id for Psi-Blast runs with 2 iterations with different E-Values (def E-Value vs E-Value of10e-10)]] |
||
+ | <figure id="PSI_10e10_seqd"> |
||
+ | [[File:CD_PSI_10e10_seqd_distri.png|thumb|380px|center|<b><xr nolink id="PSI_10e10_seqd"/></b><br> Distribution of Sequence Id between Psi-Blast runs with 2 iterations against 10 iterations (using E-Value 10e-10)]]</figure> |
||
</td> |
</td> |
||
+ | <td> |
||
− | <td>[[File:CD_PSI_10it_seqid_distri.png |thumb|300px|center|<b>Figure ??</b><br> Distribution of Sequence Id for Psi-Blast runs with 10 iterations with different E-Values (def E-Value vs E-Value of10e-10)]] |
||
+ | <figure id="PSI_10it_seqid"> |
||
+ | [[File:CD_PSI_10it_seqid_distri.png |thumb|340px|center|<b><xr nolink id="PSI_10it_seqid"/></b><br> Distribution of Sequence Id for Psi-Blast runs with 10 iterations with different E-Values (def E-Value against E-Value of 10e-10)]] |
||
+ | </figure> |
||
</td> |
</td> |
||
<td> |
<td> |
||
+ | <figure id="eval_distri_psiblast"> |
||
− | [[File:cd_eval_distri_psiblast.png|thumb|300px|center|<b>Figure ??</b><br> Distribution of logarithmic E_Values for the four different PSIBlast runs]]</td> |
||
+ | [[File:cd_eval_distri_psiblast.png|thumb|380px|center|<b><xr nolink id="eval_distri_psiblast"/></b><br> Distribution of E_Values for the four different PsiBlast runs]] |
||
+ | </figure> |
||
+ | </td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | </center> |
||
+ | |||
+ | |||
+ | For the run with 2 iterations and the default cutoff value of 0.002, we got 915 hits. We considered 597 hits as significant (E-Value cutoff 2e-3). 586 of these significant proteins have GO terms annotated. As one can see in <xr id="psi_2it_def"/>, the four most often occuring GO terms are shared with Aspartoacylase. the fifth most often occuring GO term is "Zinc ion Binding", which is, as already mentioned, not annotated with Aspartoacylase, even though it does bind Zinc. Often occuring terms are on processes involving arginine, which might belong to the Succinylglutamate Desuccinylases. |
||
+ | |||
+ | A similar GO term statistic is obtained run with two iterations and an E-Value cutoff of 10e-10 (see <xr id="psi_2it_10e10"/>). Here we received 835 hits out of which we considered 502 proteins as significant. 496 proteins have GO terms annotated. |
||
+ | |||
+ | For the PsiBlast run with 10 iterations and default cutoff, the GO term analysis looks quite different though (see <xr id="psi_10it_def"/>). This run resulted in 3211 hits out of which 3152 proteins had annotated GO terms. The GO term, that is found most often is "Zinc ion binding" which might as well be associated with Aspartoacylase. Other often occuring GO terms belong to other protein families like Succinylglutamate Desuccinylases or Carboxypeptidases, which we observed when checking the hit list. |
||
+ | |||
+ | Interestingly, when restricting the E-value to 10e-10, the most often occuring GO terms again are shared with Aspartoacylase (see <xr id="psi_10it_10e10"/>). This underlines the effect of finding more close relatives, that has already been seen in the higher sequence identity of found hits for the restricted run. |
||
+ | |||
+ | <center> |
||
+ | <table> |
||
+ | <tr> |
||
+ | <td> |
||
+ | <figure id="psi_2it_def"> |
||
+ | [[File:cd_go_psiblast_2_def.png|thumb|280px|center|<b><xr nolink id="psi_2it_def"/></b><br> GO Term Enrichment for Hits of the PsiBlast run with 2 iterations and E-Value cutoff at 0.002. Most hits share several GO terms with Aspartoacylase.]]</figure> |
||
+ | </td> |
||
+ | <td> |
||
+ | <figure id="psi_2it_10e10"> |
||
+ | [[File:cd_go_psiblast_2_10e10.png|thumb|250px|center|<b><xr nolink id="psi_2it_10e10"/></b><br> GO Term Enrichment for Hits of the PsiBlast run with 2 iterations and E-Value cutoff at 10e-10. Most hits share several GO terms with Aspartoacylase.]] |
||
+ | </figure> |
||
+ | </td> |
||
+ | <td> |
||
+ | <figure id="psi_10it_def"> |
||
+ | [[File:cd_go_psiblast_10_def.png|thumb|280px|center|<b><xr nolink id="psi_10it_def"/></b><br>GO Term Enrichment for Hits of the PsiBlast run with 10 iterations and E-Value cutoff at 0.002. Most hits have different GO terms than Aspartoacylase, since they are mainly Succinylglutamate Desuccinylases. Yet, a second class of hits shares some GO terms with Aspartoacylase.]] |
||
+ | </figure> |
||
+ | </td> |
||
+ | <td> |
||
+ | <figure id="psi_10it_10e10"> |
||
+ | [[File:cd_go_psiblast_10_10e10.png|thumb|300px|center|<b><xr nolink id="psi_10it_10e10"/></b><br> GO Term Enrichment for Hits of the PsiBlast run with 10 iterations and E-Value cutoff at 10e-10. Most hits share GO terms with Aspartoacylase.]] |
||
+ | </figure> |
||
+ | </td> |
||
</tr> |
</tr> |
||
</table> |
</table> |
||
Line 146: | Line 271: | ||
===HHBlits=== |
===HHBlits=== |
||
− | Running HHBlits with 2 iterations yields |
+ | Running HHBlits with 2 iterations yields 305 clusters, and when restricting the E-Value for inclusion of sequences for further iterations to 10e-10 it is only 76 clusters. The best hits have very low E-Values and can thus be considered very significant. To increase the amount of hits, we repeated the HHBlits search with the maximum amount of 8 iterations which resulted in a broader output with more hits with lower averaged E-Values (compare <xr id="hhblits_seq_distri"/>). |
− | Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see |
+ | Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see <xr id="hhblits_eval_distri"/>). |
+ | <table> |
||
− | [[File:CD_HHBlits_seqid_distri.png|thumb|300px|center|<b>Figure ??</b><br> Sequence identity distributions of HHBlits run with 2 and with 8 iterations.]] |
||
+ | <tr> |
||
+ | <td> |
||
+ | <figure id="hhblits_seq_distri">[[File:CD_HHBlits_seqid_distri.png|thumb|300px|<b><xr nolink id="hhblits_seq_distri"/></b><br> Sequence identity distributions of HHBlits run with 2 and with 8 iterations.]]</figure > |
||
+ | </td> |
||
+ | <td><figure id="hhblits_eval_distri">[[File:cd_eval_distri_HHblits.png|thumb|300px|<b><xr nolink id="hhblits_eval_distri"/></b><br> logarithmic E_Value distributions of HHBlits run with 2 and with 8 iterations.]]</figure></td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | ===Overlap=== |
||
− | ===Comparison of found sequences === |
||
+ | As one can see in <xr nolink id="overlap_distri"/>, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent). |
||
+ | <figure id="overlap_distri">[[File:seq_overlap.png|thumb|300px|center|<b><xr nolink id="overlap_distri"/></b> <br> Distribution of overlapping Hits for the eight different used Sequence Searches.]]</figure> |
||
+ | |||
+ | |||
+ | <table> |
||
+ | <tr> |
||
+ | <td>[[File:defaults.png|200px|thumb|Default E-Values: as could be expected, the normal BLAST search is mostly contained in the PsiBLAST search with two iterations. HHBlits found a large number of different hits, with only 48 out of 274 common hits in common with the BLAST searches.]]</td> |
||
+ | <td>[[File:Psi2_psi10_blast_defs.png|200px|thumb|Taking PsiBLAST with 10 iterations into account brings in a large number of common sequences among the three searches (110), which could be interesting since there seems to be high conversation among them.]]</td> |
||
+ | <td>[[File:blast_psi2_e10_hhblits.png|200px|thumb|Strict E-Values for PsiBLAST and default E-Value for HHBlits with 2 iterations: The number of common hits among all three is now substantially lower, while PsiBLAST with two and ten iterations share a great number of their hits. ]]</td> |
||
+ | <td>[[File:Psi2_psi10_e10_hhblits_n10.png|200px|thumb|Increasing the number of HHBlits-iteration yields more hits for HHBlits, but does not increase the number of common hits with PSI-Blast in 2 or 10 iterations. However, 10 sequences are common and could be interesting for further investigation.]]</td> |
||
+ | </tr></table> |
||
+ | |||
+ | == Multiple Sequence Alignments == |
||
+ | |||
+ | For generating our dataset for the MSA we clustered all Hits into Sequence Identity groups: |
||
+ | * >90%: 1 |
||
+ | *60-89%: 59 |
||
+ | *40-59%: 197 |
||
+ | *20-39%: 1141 |
||
+ | |||
+ | Since we only got one hit with an sequence Identity >90% we decided to group out hits as follows: |
||
+ | three groups of sequences with eight members each: |
||
+ | *60-99% |
||
+ | *40-59% |
||
+ | *20-39% |
||
+ | |||
+ | We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%). |
||
+ | |||
+ | |||
+ | id eVal identity |
||
+ | |||
+ | # 60-99% sequence identity |
||
+ | Q8BZC2 1.7e-25 90 |
||
+ | E1BVP5 e-140 72 |
||
+ | H2RVG4 e-141 63 |
||
+ | G3VM93 e-105 72 |
||
+ | F6ZFQ0 e-139 78 |
||
+ | F8WFU8 e-145 86 |
||
+ | Q28C61 e-132 68 |
||
+ | H2M5L4 e-133 64 |
||
+ | |||
+ | # 40-59% sequence identity |
||
+ | G5BTW1 e-133 43 |
||
+ | G6FRX8 e-103 39 |
||
+ | F7NV91 e-112 39 |
||
+ | G1Q6P7 e-120 42 |
||
+ | H0WH68 e-135 44 |
||
+ | F2PFG6 e-119 40 |
||
+ | H2MX25 5e-81 40 |
||
+ | Q1Z2X2 e-115 38 |
||
+ | |||
+ | # 20-39% sequence identity |
||
+ | Q2F9Q7 e-109 31 |
||
+ | Q8YQC1 e-117 41 |
||
+ | E1SMZ8 e-108 39 |
||
+ | D7E1T3 e-110 36 |
||
+ | A5GQV1 7e-92 33 |
||
+ | E8LP14 e-107 31 |
||
+ | F9TUZ3 e-106 30 |
||
+ | A6VUE4 e-101 35 |
||
+ | |||
+ | |||
+ | ===General Results=== |
||
+ | |||
+ | All in all the three Alignment methods yield comparable results. One can identify several conserved regions. Especially the two groups with sequence identities <60% show very similar MSAs. |
||
+ | |||
+ | There are three strongly conserved motivs located in the first half of the sequences: |
||
+ | *GGTHGNE |
||
+ | *DLNR |
||
+ | *DLHNT |
||
+ | |||
+ | For the second half of the sequence alignments there is no clear concensus about reserved motifs, but several residues are strongly conserved and may be of functional or structural importance. |
||
+ | |||
+ | In the alignment of the >60% group the first two motifs are not colored in the alignment. This is due to two very short sequences which produce gaps in the alignment and thus lower the consensus. |
||
+ | |||
+ | ===ClustalW=== |
||
+ | command: <code> clustalw -align -infile=./db_over60.fa -outfile=./clustalw_msa_60.aln </code> |
||
+ | |||
+ | |||
+ | |||
+ | <table> |
||
+ | <tr> |
||
+ | <td>[[File:clustal_20.png|thumb|300px|Jalview Representation of the ClustalW Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:clustal_40.png|thumb|300px|Jalview Representation of the ClustalW Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:clustal_60.png|thumb|300px|Jalview Representation of the ClustalW Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).]]</td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | |||
+ | ===TCoffee=== |
||
+ | <table> |
||
+ | <tr> |
||
+ | <td>[[File:tcoffee_20.png|thumb|300px|Jalview Representation of the T-Coffee Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:tcoffee_40.png|thumb|300px|Jalview Representation of the T-Coffee Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:tcoffee_60.png|thumb|300px|Jalview Representation of the T-Coffee Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).]]</td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | |||
+ | ===Muscle=== |
||
+ | |||
+ | <table> |
||
+ | <tr> |
||
+ | <td>[[File:muscle_20.png|thumb|300px|Jalview Representation of the Muscle Alignment with the dataset with 20-39% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:muscle_40.png|thumb|300px|Jalview Representation of the Muscle Alignment with the dataset with 40-59% sequence identity. Colored are conserved residues (>65%).]]</td> |
||
+ | |||
+ | <td>[[File:muscle_60.png|thumb|300px|Jalview Representation of the Muscle Alignment with the dataset with 60-100% sequence identity.Colored are conserved residues (>65%).]]</td> |
||
+ | </tr> |
||
+ | </table> |
||
+ | |||
+ | ===Concerning the wildtype human Aspartoacylase=== |
||
+ | |||
+ | The three identified motifs can also be foung in the wildtype protein (compare coloring of sequence on top of page). We colored the respective residues in the structure. They all are positioned in the same region of the protein and thus might implicate an important functional region. |
||
+ | |||
+ | <table> |
||
+ | <tr> |
||
+ | <td>[[File:DLHNT.png|thumb|300px|Human Aspartoacylse (PDB 2O53) with the motif residues DLHNT]]</td> |
||
+ | |||
+ | <td>[[File:DLNR.png|thumb|300px|Human Aspartoacylse (PDB 2O53) with the motif residues DLNR]]</td> |
||
+ | |||
+ | <td>[[File:GGLNR.png|thumb|300px|Human Aspartoacylse (PDB 2O53) with the motif residues GGLNR.]]</td> |
||
+ | </tr> |
||
+ | </table> |
Latest revision as of 17:49, 31 August 2012
Contents
Protocol
Further information can be found in the protocol.
GO Term Enrichment
In the following we are performing different sequence searches with the protein Aspartoacylase (UniProt ID: P45381). In order to validate the found hits, we are looking for common GO classifications of the hits with the query sequence.
For our protein Aspartoacylase there are 17 annotated GO terms (using EMBLs QuickGO):
<figtable id="aspa_go_terms">
GO ID | GO Name | |
Cellular Compartment | GO:0005634 | nucleus (3X) |
GO:0005634 | cytoplasm (4X) | |
Biological Process | GO:0006533 | aspartate catabolic process |
GO:0008152 | metabolic process | |
GO:0022010 | central nervous system myelination | |
GO:0048714 | positive regulation of oligodendrocyte differentiation | |
Molecular Function | GO:0046872 | metal ion binding |
GO:0004046 | aminoacylase activity | |
GO:0016787 | hydrolase activity | |
GO:0016788 | hydrolase activity, acting on ester bonds | |
GO:0016811 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides | |
GO:0019807 | aspartoacylase activity |
</figtable>
Pairwise Sequence Search
BLASTP
The blast search with default parameters yielded 196 results when using the default E-Value cutoff of 10. Being more restrictive and considering only hits with E-Values less than 0.002 we get 104 hits. The best alignment is with an uncharacterized Protein from Rattus norvegicus and has an E-Value of e-155. Most of the resulting proteins are Aspartoacylases of other species. There also are a lot of uncharacterized proteins, just as our best hit. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which are in the same protein family (Desuccinylase / Aspartoacylase family) and catalyze a reaction similar to Aspartoacylase. |
<figtable id="aspa_blastp">
</figtable> |
PSIBLAST
We run PSIBlast with four different parameter combinations:
- 2 iterations (j=2), default E-Value cutoff for inclusion of sequences into profile (h=0.002)
- 2 iterations (j=2), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10)
- 10 iterations (j=10), default E-Value cutoff for inclusion of sequences into profile (h=0.002)
- 10 iterations (j=10), strict E-Value cutoff for inclusion of sequences into profile (h=10e-10)
Again we considered only hits with an E-Value up to 0.002 as significant. The Psiblast run with default parameters (2 iterations, EValue 0.002) results in an amount of 597 hits. The most significant hit again is the uncharacterized protein of Rattus Norvegicus with an E-Value of e-142. When restricting the search to include only sequences with an E-Value of up to 10e-10, we still get 502 hits. The best hit still is the Rattus Norvegicus protein.
In contrast to the simple Blast search, the PsiBlast runs with two iterations find more distant related proteins. This can be seen in the great amount of Succinylglutamate desuccinylasen that are found (though with higher E-Values). They belong, as already mentioned, to the same Pfam family as Aspartoacylase.
Increasing the number of iterations obviously results in many more hits. When using a less restrictive E-Value more than 3000 hits are found against 1500 when using a more restrictive E-Value. Interestingly, the best hits (Rattus Norvegicus again) are less significant (~e-70) than for the PsiBlast search with only 2 iterations (~e-140). The majority of found proteins now are Succinylglutamate Desuccinylases, even among the most significant hits (first Succinylglutamate Desuccinylase is ranked 15th). Only among the first 15 significant results orthologs of Aspartoacylase can be found. Additionally, even further relatives are found, like Zinc Carboxypeptidases, Carboxypeptidases, Endopeptidases, etc.
<figtable id="aspa_psiblast">
Parameters | it2, def E-Value (h=2e-3) | it2, E-Value h=10e-10 | it10 def E-Value (h=2e-3) | it10 E-Value h=10e-10 |
time | ~2m30 | ~2m30 | ~25m | ~22m |
best E-Value | 1e-142 | 1e-145 | 2e-46 | e-68 |
results (EVal: 2e-3) | 597 | 502 | 3211 | 1515 |
hits with GO terms | 586 | 496 | 3152 | 1461 |
</figtable>
HHBLITS
We used the same parameters for the HHblits search as we used for PsiBlast. This means:
- 2 iterations (n=2), E-Value cutoff for inclusion in result alignment (e=0.002)
- 2 iterations (n=2), strict E-Value cutoff for inclusion in result alignment (e=10e-10)
- 8 iterations (n=8), E-Value cutoff for inclusion in result alignment (e=0.002)
- 8 iterations (n=8), strict E-Value cutoff for inclusion in result alignment (e=10e-10)
For the runs with 2 iterations there are only few clusters containing Aspartoacylases. Most other cluster are comprised of uncharacterized proteins or Succinylases. Yet, the best hit in both runs with different E-Values contains Aspartoacylase proteins, even though the E-Values differ for this best hit cluster in both runs.
For the runs with 8 iterations, in case of an EValue of 0.002 much more clusters are found, that contain very distantly related proteins. The best hit again, is a cluster containing Aspartoacylases. When using a stricter EValue of 10e-10 one receives a smaller amount of hits (even smaller than for 2 iterations and EValue = 0.002), with a lot of closer related sequences, mostly Succinylases.
One has to notice, that for all runs, the first five hit clusters are always the same, just with a slightly different ranking.
<figtable id="aspa_hhblits">
Parameters | it2, E-Value e=2e-3 | it2, E-Value e=10e-10 | it8, E-Value e=2e-3 | it8, E-Value e=10e-10 |
time | ~2m20 | ~4m | ~11m10 | ~10m55 |
best E-Value | 1e-107 | 2e-141 | 5.1e-74 | 5.5e-74 |
results (EVal: 2e-3) | 305 | 76 | 776 | 246 |
</figtable>
Validation and Comparison
Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search.
In general, one can distinguish between two kinds of proteins, that frequently are identified by the sequence searches:
- Aspartoacylases
- Succinylglutamate Desuccinylases
BlastP
<figtable id="blastp_comp">
With a simple blast search we were able to identify the closest related sequences. The most significant hit(Rattus Norvegicus) has a sequence identity of 82%. In <xr id="blastp_comp"/> the distribution of the sequence identity of all hits with E-Value < 0.002 is depicted. As one can easily see, there are only few hits with high sequence identities and the majority of hits has sequence identities of about 30%.
For all 94 hits found with an E-Value cutoff of 10e-10, there are annotated GO terms. Furthermore all founds hits share the GO term "hydrolase activity, acting on ester bonds" and "metabolic process". Also, as one can see in <xr id="go_blastp_10e10" /> all hits share the most GO terms with Aspartacylase. Again, "Zinc binding" could also be associated with Aspartoacylase. Therefore, all GO terms that are found more than 5 times, are also associated with Aspartoacylase. The results are more accurate concerning shared GO terms with Aspartoacylase. This is what one would expect when restricting the EValue for finding the closer related proteins. |
<figure id="blastp_comp"> </figure> |
</figtable>
<figtable id="go_blastp">
E-Value | default (10) | 10e-10 |
GO terms |
<figure id="go_blastp_def"> </figure> |
<figure id="go_blastp_10e10"> </figure> |
</figtable>
Psi Blast
Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. This outcome is also represented in the distribution of sequence identities. As one can see in <xr id="PSI_10e10_seqd"/>, running PSI-Blast with 10 iterations firstly results in more significant hits and secondly most hits have lower sequence identity compared to the run with two iterations.
When not restricting the E-Value Cutoff for the profile built-up, we found that hits with lower sequence identity (meaning more distantly related) are included in the final hit list. This goes along with our observation that more hits are classified as Succinylglutamate Desuccinylases than as Aspartoacylases when using the default cutoff at 0.002. Furthermore, when the more restrictive cutoff is used, simply less hits are being found (see <xr id="PSI_10it_seqid"/>).
The majority of the results from the runs with only two iterations, has moderate sequence identities with a broad distribution between 15% and 45%. In contrast, the results from the run with 10 iterations split up into two groups of hits which form cluster at about 10% and between 30% and 40% sequence identity.
These observations are also represented in the E_Value distribution. For runs with two iterations there are some results covering the range of E-Values between e-145 and e-60 and majority of hits with low E-Values between e-8 and the cutoff at 0.002. The runs with 10 iterations almost exclusively result in lower significant hits. For the run with the restricted cutoff, hits have E-Values ranging from e-60 to the cutoff at 0.002. There is a peak for hits with E-Values just about the cutoff Value of 0.002. The run with the default E-Value cutoff results in hits with E-Values ranging from e-30 to 0.002.
<figure id="PSI_10e10_seqd"> </figure> |
<figure id="PSI_10it_seqid"> </figure> |
<figure id="eval_distri_psiblast"> </figure> |
For the run with 2 iterations and the default cutoff value of 0.002, we got 915 hits. We considered 597 hits as significant (E-Value cutoff 2e-3). 586 of these significant proteins have GO terms annotated. As one can see in <xr id="psi_2it_def"/>, the four most often occuring GO terms are shared with Aspartoacylase. the fifth most often occuring GO term is "Zinc ion Binding", which is, as already mentioned, not annotated with Aspartoacylase, even though it does bind Zinc. Often occuring terms are on processes involving arginine, which might belong to the Succinylglutamate Desuccinylases.
A similar GO term statistic is obtained run with two iterations and an E-Value cutoff of 10e-10 (see <xr id="psi_2it_10e10"/>). Here we received 835 hits out of which we considered 502 proteins as significant. 496 proteins have GO terms annotated.
For the PsiBlast run with 10 iterations and default cutoff, the GO term analysis looks quite different though (see <xr id="psi_10it_def"/>). This run resulted in 3211 hits out of which 3152 proteins had annotated GO terms. The GO term, that is found most often is "Zinc ion binding" which might as well be associated with Aspartoacylase. Other often occuring GO terms belong to other protein families like Succinylglutamate Desuccinylases or Carboxypeptidases, which we observed when checking the hit list.
Interestingly, when restricting the E-value to 10e-10, the most often occuring GO terms again are shared with Aspartoacylase (see <xr id="psi_10it_10e10"/>). This underlines the effect of finding more close relatives, that has already been seen in the higher sequence identity of found hits for the restricted run.
<figure id="psi_2it_def"> </figure> |
<figure id="psi_2it_10e10"> </figure> |
<figure id="psi_10it_def"> </figure> |
<figure id="psi_10it_10e10"> </figure> |
HHBlits
Running HHBlits with 2 iterations yields 305 clusters, and when restricting the E-Value for inclusion of sequences for further iterations to 10e-10 it is only 76 clusters. The best hits have very low E-Values and can thus be considered very significant. To increase the amount of hits, we repeated the HHBlits search with the maximum amount of 8 iterations which resulted in a broader output with more hits with lower averaged E-Values (compare <xr id="hhblits_seq_distri"/>). Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see <xr id="hhblits_eval_distri"/>).
<figure id="hhblits_seq_distri"></figure > | <figure id="hhblits_eval_distri"></figure> |
Overlap
As one can see in <xr nolink id="overlap_distri"/>, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent).
<figure id="overlap_distri">
</figure>
Multiple Sequence Alignments
For generating our dataset for the MSA we clustered all Hits into Sequence Identity groups:
- >90%: 1
- 60-89%: 59
- 40-59%: 197
- 20-39%: 1141
Since we only got one hit with an sequence Identity >90% we decided to group out hits as follows: three groups of sequences with eight members each:
- 60-99%
- 40-59%
- 20-39%
We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).
id eVal identity # 60-99% sequence identity Q8BZC2 1.7e-25 90 E1BVP5 e-140 72 H2RVG4 e-141 63 G3VM93 e-105 72 F6ZFQ0 e-139 78 F8WFU8 e-145 86 Q28C61 e-132 68 H2M5L4 e-133 64 # 40-59% sequence identity G5BTW1 e-133 43 G6FRX8 e-103 39 F7NV91 e-112 39 G1Q6P7 e-120 42 H0WH68 e-135 44 F2PFG6 e-119 40 H2MX25 5e-81 40 Q1Z2X2 e-115 38 # 20-39% sequence identity Q2F9Q7 e-109 31 Q8YQC1 e-117 41 E1SMZ8 e-108 39 D7E1T3 e-110 36 A5GQV1 7e-92 33 E8LP14 e-107 31 F9TUZ3 e-106 30 A6VUE4 e-101 35
General Results
All in all the three Alignment methods yield comparable results. One can identify several conserved regions. Especially the two groups with sequence identities <60% show very similar MSAs.
There are three strongly conserved motivs located in the first half of the sequences:
- GGTHGNE
- DLNR
- DLHNT
For the second half of the sequence alignments there is no clear concensus about reserved motifs, but several residues are strongly conserved and may be of functional or structural importance.
In the alignment of the >60% group the first two motifs are not colored in the alignment. This is due to two very short sequences which produce gaps in the alignment and thus lower the consensus.
ClustalW
command: clustalw -align -infile=./db_over60.fa -outfile=./clustalw_msa_60.aln
TCoffee
Muscle
Concerning the wildtype human Aspartoacylase
The three identified motifs can also be foung in the wildtype protein (compare coloring of sequence on top of page). We colored the respective residues in the structure. They all are positioned in the same region of the protein and thus might implicate an important functional region.