Difference between revisions of "Canavan Task 2 - Sequence alignments"

From Bioinformatikpedia
(Multiple Sequence Alignments)
(Multiple Sequence Alignments)
Line 180: Line 180:
   
 
We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).
 
We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).
  +
{|style="border-collapse: separate; border-spacing: 0; border-width: 1px; border-style: solid; border-color: #000; padding: 0; align: center; " width="45%"
 
  +
! style="border-style: solid; border-width: 0 1px 1px 0"| 60-99% seq id
 
  +
id eVal identity coverage alignment_length
! style="border-style: solid; border-width: 0 1px 1px 0"| 40-59% seq id
 
  +
! style="border-style: solid; border-width: 0 1px 1px 0"| 20-39% seq id
 
  +
# whole range = Set100
|-
 
  +
tr|C7PCU7|C7PCU7_CHIPD 5e-63 21 0.9933 474
| style="border-style: solid; border-width: 0 1px 1px 0"| Q8BZC2
 
  +
tr|B3RSE1|B3RSE1_TRIAD 2e-93 49 0.8415 362
| style="border-style: solid; border-width: 0 1px 1px 0"| G5BTW1
 
  +
tr|G2PG26|G2PG26_STRVO 1e-105 33 0.5854 427
| style="border-style: solid; border-width: 0 1px 1px 0"| Q2F9Q7
 
  +
tr|Q8RX86|Q8RX86_ARATH 1e-105 35 0.9814 422
|-
 
  +
tr|G8NYA7|G8NYA7_GRAMM 4e-61 22 0.6186 470
| style="border-style: solid; border-width: 0 1px 1px 0"| E1BVP5
 
  +
tr|H1Q7I8|H1Q7I8_9ACTO 1e-97 37 0.5409 396
| style="border-style: solid; border-width: 0 1px 1px 0"| G6FRX8
 
  +
tr|E1ZHK5|E1ZHK5_CHLVA 8e-80 38 0.8485 368
| style="border-style: solid; border-width: 0 1px 1px 0"| Q8YQC1
 
  +
sp|Q0CEF5|AGALG_ASPTN 4e-63 12 0.611 478
|-
 
  +
tr|F5BFS9|F5BFS9_TOBAC 1e-106 36 0.9534 410
| style="border-style: solid; border-width: 0 1px 1px 0"| HHBlits
 
  +
tr|F8FLU8|F8FLU8_PAEMK 1e-76 10 0.6091 474
| style="border-style: solid; border-width: 0 1px 1px 0"| n = 8
 
  +
| style="border-style: solid; border-width: 0 1px 1px 0"| 16m7.754s
 
  +
# <40% sequence identity = Set40
|}
 
  +
tr|B8P149|B8P149_POSPM 3e-80 28 0.9425 432
  +
tr|G2TQE8|G2TQE8_BACCO 7e-68 8 0.5795 452
  +
tr|F9HJT9|F9HJT9_9STRE 3e-70 11 0.5709 452
  +
tr|H2JN17|H2JN17_STRHY 3e-69 23 0.774 504
  +
tr|C5AKH4|C5AKH4_BURGB 2e-67 26 0.5488 403
  +
sp|Q0CEF5|AGALG_ASPTN 4e-63 12 0.611 478
  +
tr|B3CFN7|B3CFN7_9BACE 1e-78 26 0.5828 412
  +
tr|D4KDQ2|D4KDQ2_9FIRM 8e-76 10 0.611 483
  +
tr|D4W2N5|D4W2N5_9FIRM 1e-67 10 0.5478 435
  +
tr|F2USV1|F2USV1_SALS5 1e-88 35 1 467
  +
  +
# >60% sequence identity = Set60
  +
tr|G1P280|G1P280_MYOLU 1e-108 78 0.9699 420
  +
tr|Q4RTE7|Q4RTE7_TETNG 7e-89 71 0.7319 314
  +
tr|F1Q5G5|F1Q5G5_DANRE 1e-106 67 0.9138 392
  +
tr|E1B725|E1B725_BOVIN 1e-111 76 0.9727 428
  +
tr|H2U095|H2U095_TAKRU 1e-101 65 0.9424 412
  +
tr|G1T044|G1T044_RABIT 1e-109 82 0.9698 417
  +
tr|C0HA45|C0HA45_SALSA 1e-102 63 0.9534 409
  +
tr|H0WQ54|H0WQ54_OTOGA 4e-87 71 0.9953 428
  +
tr|G3WK18|G3WK18_SARHA 1e-108 72 0.9388 414
  +
tr|H2L5H7|H2L5H7_ORYLA 1e-100 61 0.9534 411
  +
   
   

Revision as of 21:10, 7 May 2012

Sequence Search

Sorry, guys, we're a bit behind schedule! Hope to have everything finished before 11pm tonight (Monday) and hope that's early enough for you to read. Sorry again! Susi and Fanny


Sequence

The native ASPA sequence that we used for the current task is shown below:

UniProt: P45381

>hsa:443 ASPA, ACY2, ASP; aspartoacylase; K01437 aspartoacylase [EC:3.5.1.15] (A)
MTSCHIAEEHIQKVAIFGGTHGNELTGVFLVKHWLENGAEIQRTGLEVKPFITNPRAVKK
CTRYIDCDLNRIFDLENLGKKMSEDLPYEVRRAQEINHLFGPKDSEDSYDIIFDLHNTTS
NMGCTLILEDSRNNFLIQMFHYIKTSLAPLPCYVYLIEHPSLKYATTRSIAKYPVGIEVG
PQPQGVLRADILDQMRKMIKHALDFIHHFNEGKEFPPCAIEVYKIIEKVDYPRDENGEIA
AIIHPNLQDQDWKPLHPGDPMFLTLDGKTIPLGGDCTVYPVFVNEAAYYEKKEAFAKTTK
LTLNAKSIRCCLH



Search

BLASTP

We ran BlastP on student machines with the big_80 as a reference database.

Command: blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o blastp_p45381_wt_big80.out


Parametersdefault E-Value = 10 E-Value 10e-10
results19694
best E-Value1e-1551e-155
worst E-Value9.6e-15
commentMost of the resulting proteins are Aspartoacylases of other species. Most of the results with EValue > e-15 are Succinylglutamate Desuccinylases, which catalyze a reaction similar to Aspartoacylase.The results are the same as for the first run, just with an earlier cutoff

PSIBLAST

PSIBlast was used in the same fashion as BLAST, with the big_80 as the background database. Commands:

  • Running 2 iterations and default E-Value 0.002
    • blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it2_p45381_wt_big80.out -j 2


  • 2 iterations, more strict E-value cutoff of 10E-10
    • blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it2_h10e10_p45381_wt_big80.out -j 2 -h 10e-10


  • 10 iterations, default Evalue 0.002
    • blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it10_p45381_wt_big80.out -j 10


  • 10 iterations, E-value cutoff 10E-10
    • blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P45381_wt.fasta -o psiblast_it10_h10e10_p45381_wt_big80.out -j 10 -h 10e-10


Parameters it2, def E-Value (0.002) it2 E-Value 10e-10 it10 def E-Value (0.002)it10 E-Value 10e-10
time ~2m30 ~2m30 ~10m time: ~10m
results 500 93 500 500
best E-Value1e-142 1e-145 5e-70 7e-70
worst E-Value3e-4 2e-29 8e-38 1e-38
commentsResults with best EValues are mostly Aspartoacylases, Sequences previously not found are mostly Succinylglutamate Desuccinylasesresults mainly Aspartoacylases- converged after 8 rounds
- most significant results include more Succinylglutamate Desuccinylases than Aspartoacylases
- all 10 iterations were done (no early convergence)
- aspartoacylases slightly more frequent in lower E-Values (< E-58), but no significant difference in E-Values for aspas and succis

HHBLITS

Run HHBlits on student machines with Uniprot20 database.

Commands:

  • 2 iterations:
    • hhblits -i P45381_wt.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o hhblits_p45381_def.out
  • 8 iterations:
    • hhblits -i P45381_wt.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -n 8 -o hhblits_p45381_n10.out

-n number of iterations (def 2)

Parametersit 2 it 8
time2m50~6m
results274500
best E-Value2e-1102.9e-68
worst E-Value0.00119.5e-09
commentmixed results with Aspartoacylases and Succivery varying results: Aspartoacylasen, Succinylasen, Zinc Proteins

Summary and Comparison

Along with the expactations one can find more hits with Psi-Blast than with a simple Blast search.

In general, one can distinguish between two kinds of proteins, that frequently are identified by the sequence searches:

  • Aspartoacylases
  • Succinylglutamate Desuccinylases


BlastP

A simple blast search yields only about 90 significant hits if one considers a threshold of 10e-10 as a significance cutoff. As one can see in Figure ??, the restriction of the E-Value results in less hits with a low sequence similarity.

Comparison of distribution of Sequence Identiy between the two BlastP runs

Psi Blast

Increasing the amount of iterations performed in a PSI-Blast search, obviously increases the running time. One can see, that the best ranked hits of the runs with 10 iterations have lower E-Values than the best hits of the runs with less iterations. Yet, the result includes a larger amount of significant hits with higher E-Values. This means, increasing the iterations finds further distantly related sequences, which is the expected outcome. This outcome is also represented in the distribution of sequence identities. As one can see in figure ??, running PSI-Blast with 10 iterations results in hits with a lower sequence identity to our query sequence than the hits from the run with 2 iterations.


When restricting the E-Value Cutoff for the profile built-up, we found that more hits are classified as Aspartoacylases than as Succinylglutamate Desuccinylases. The running time, as well as the E-Values of the resulting hits did not change significantly. The majority of the results from the runs with only two iterations, has moderate sequence identities with a broad distribution between 10% and 50%. In contrast, the results from the run with 10 iterations split up into two groups of hits which form cluster at about 15% and 35% sequence identity. This difference is also represented in the E_Value distribution. The runs with 10 iterations result in Hits with moderate E_Values between -200 and -40 log(E_Values). The runs with 10 iterations in contrast result in many low significant hits (log(E_Value > -20)) and a variety of high significant hits.

Figure ??
Distribution of Sequence Id between Psi-Blast runs with 2 iterations vs 10 iterations (using E-Value 10e-10)
Figure ??
Distribution of Sequence Id for Psi-Blast runs with 2 iterations with different E-Values (def E-Value vs E-Value of10e-10)
Figure ??
Distribution of Sequence Id for Psi-Blast runs with 10 iterations with different E-Values (def E-Value vs E-Value of10e-10)
Figure ??
Distribution of logarithmic E_Values for the four different PSIBlast runs

HHBlits

Running HHBlits with 2 iterations yields a small amount of hits (270) with very low (2e-110) and very high (0.0011) E-Values. To increase the amount of hits, we repeated the HHBlits search with the maximum amount of 8 iterations which resulted in a broader output with more Hits with lower averaged E-Values (compare figure ??). Regarding the Sequence Identity distribution, running HHBlits with 8 iterations results in more distant related Hits (see Figure ??).

Figure ??
Sequence identity distributions of HHBlits run with 2 and with 8 iterations.
Figure ??
logarithmic E_Value distributions of HHBlits run with 2 and with 8 iterations.

Overlap

As one can see in Figure ??, roughly 40 percent of the resulting hits are unique to each method. From our considerations, about 25 percent of the hits are significant hits, that could be further investigated (overlap of 50 percent).

Figure ??
Distribution of overlapping Hits for the eight different used Sequence Searches.


Default E-Values: as could be expected, the normal BLAST search is mostly contained in the PsiBLAST search with two iterations. HHBlits found a large number of different hits, with only 48 out of 274 common hits in common with the BLAST searches.
Taking PsiBLAST with 10 iterations into account brings in a large number of common sequences among the three searches (110), which could be interesting since there seems to be high conversation among them.
Strict E-Values for PsiBLAST and default E-Value for HHBlits with 2 iterations: The number of common hits among all three is now substantially lower, while PsiBLAST with two and ten iterations share a great number of their hits.
Increasing the number of HHBlits-iteration yields more hits for HHBlits, but does not increase the number of common hits with PSI-Blast in 2 or 10 iterations. However, 10 sequences are common and could be interesting for further investigation.

Further Evaluation

We tried to further validate the sequence search hits via structural similarity. Unfortunately none of the resulting Hits was a PDB Hit. Furthermore we tried to map the sequence identifiers against the UniProtKB/Swiss-Prot PDB cross-references (http://www.uniprot.org/docs/pdbtosp.txt). Again, this mapping yielded no results, which is why we cannot include any structural information for our ongoing research. When inspecting the annotation for the sequence hits, we already found, that the majority of the hits codes for Aspartoacylases or respectively the highly related protein Succinylglutamate Desuccinylases. Since there already exists a crystal structure of the human Aspartoacylase, it is only reasonable that one will not find other structures for this class of proteins. Additionally, a huge amount of hits codes for not yet characterized proteins, which also will hardly be an interesting target for crystallization.

Multiple Sequence Alignments

For generating our dataset for the MSA we clustered all Hits into Sequence Identity groups:

  • >90%: 1
  • 60-89%: 59
  • 40-59%: 197
  • 20-39%: 1141

Since we only got one hit with an sequence Identity >90% we decided to group out hits as follows: three groups of sequences with eight members each:

  • 60-99%
  • 40-59%
  • 20-39%

We chose those hits from the respective groups, that have been found by at least 4 methods (overlap of 50%).


id                     eVal  identity coverage alignment_length

# whole range = Set100
tr|C7PCU7|C7PCU7_CHIPD	5e-63	21	0.9933	474
tr|B3RSE1|B3RSE1_TRIAD	2e-93	49	0.8415	362
tr|G2PG26|G2PG26_STRVO	1e-105	33	0.5854	427
tr|Q8RX86|Q8RX86_ARATH	1e-105	35	0.9814	422
tr|G8NYA7|G8NYA7_GRAMM	4e-61	22	0.6186	470
tr|H1Q7I8|H1Q7I8_9ACTO	1e-97	37	0.5409	396
tr|E1ZHK5|E1ZHK5_CHLVA	8e-80	38	0.8485	368
sp|Q0CEF5|AGALG_ASPTN	4e-63	12	0.611	478
tr|F5BFS9|F5BFS9_TOBAC	1e-106	36	0.9534	410
tr|F8FLU8|F8FLU8_PAEMK	1e-76	10	0.6091	474

# <40% sequence identity = Set40
tr|B8P149|B8P149_POSPM	3e-80	28	0.9425	432
tr|G2TQE8|G2TQE8_BACCO	7e-68	8	0.5795	452
tr|F9HJT9|F9HJT9_9STRE	3e-70	11	0.5709	452
tr|H2JN17|H2JN17_STRHY	3e-69	23	0.774	504
tr|C5AKH4|C5AKH4_BURGB	2e-67	26	0.5488	403
sp|Q0CEF5|AGALG_ASPTN	4e-63	12	0.611	478
tr|B3CFN7|B3CFN7_9BACE	1e-78	26	0.5828	412
tr|D4KDQ2|D4KDQ2_9FIRM	8e-76	10	0.611	483
tr|D4W2N5|D4W2N5_9FIRM	1e-67	10	0.5478	435
tr|F2USV1|F2USV1_SALS5	1e-88	35	1	467
 
# >60% sequence identity = Set60
tr|G1P280|G1P280_MYOLU	1e-108	78	0.9699	420
tr|Q4RTE7|Q4RTE7_TETNG	7e-89	71	0.7319	314
tr|F1Q5G5|F1Q5G5_DANRE	1e-106	67	0.9138	392
tr|E1B725|E1B725_BOVIN	1e-111	76	0.9727	428
tr|H2U095|H2U095_TAKRU	1e-101	65	0.9424	412
tr|G1T044|G1T044_RABIT	1e-109	82	0.9698	417
tr|C0HA45|C0HA45_SALSA	1e-102	63	0.9534	409
tr|H0WQ54|H0WQ54_OTOGA	4e-87	71	0.9953	428
tr|G3WK18|G3WK18_SARHA	1e-108	72	0.9388	414
tr|H2L5H7|H2L5H7_ORYLA	1e-100	61	0.9534	411



ClustalW

command: clustalw -align -infile=./db_over60.fa -outfile=./clustalw_msa_60.aln

TCoffee

Muscle