Difference between revisions of "ASPA Sequence Alignments"
(Created page with "==Sequence Searches== ===BLAST=== Prior to running BLAST, we concatenated all of the individual FASTA files of the nonredundant library into one big file, /data/nr/nr. Since we a…") |
|||
Line 47: | Line 47: | ||
===HSSP Recall=== |
===HSSP Recall=== |
||
+ | [[File:Aspa hssp.png|500px|Identity distribution of the sequence search methods used]] |
||
− | ... |
||
Revision as of 21:57, 31 May 2011
Sequence Searches
BLAST
Prior to running BLAST, we concatenated all of the individual FASTA files of the nonredundant library into one big file, /data/nr/nr. Since we are working with proteins, BlastP was used. Command: blast -d /data/nr/nr -p blastp -i ../seq.fasta
FASTA
We had to download and build this before being able to using it. Command: fasta36 -q ../seq.fasta /data/nr/nr
PSI Blast
The preinstalled version of PSI Blast died on an error message; to fix this, we replaced /bin/sh with a symlink to /bin/bash.
Commands:
blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.00001
blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.00001
blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.005
blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.005
Runtime for this was surprisingly long at approx. 23 minutes for 3 iterations and ca. 54 minutes for 5 iterations.
HHSearch
We used the downloadable HHSearch binary and the latest snapshot of the HMM database for PDB from ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases. Since this database diverges rather strongly from NR in its composition, we did not includee HHSearch results in any comparisons. The downloaded database (we used the HMM version) had to be concatenated into one single file before HHSearch would accept it as database input; this we did using a small Ruby script.
To obtain e-values, our input sequence had to be converted to a hhm file and calibrated against a calibration data set provided on the HHSearch website. This was done via the following commands:
hhmake -i seq.fasta
hhsearch -d cal.hhm -i seq.hhm -cal
We then searched the downloaded HMM database using the newly generated HHM profile for our input sequence:
Command: hhsearch -i ../seq.hhm -d ../hhsearch-db/pdb/db -o hhsearch-ali -p 0 -E 1000 -Z 100000000000 -B 10000000000
By specifying extreme values for p-value threshold, e-value cutoff and the minimum number of sequences to be shown in alignment and hit list, we ensured that low-identity matches were being reported too.
Overlap
We visualize overlap of the results between different programs and settings using Venn diagrams; HHSearch is not included, as the difference in database composition precludes meaningful comparison. For a comparison between Blast, Fasta and PSI-Blast, we chose the PSI-Blast run with the least differences to the other PSI-Blast runs. In this particular case, the difference is fully negligible.
Score Distribution
Identity Distribution
HSSP Recall
Multiple Sequence Alignments
Sequences selected from Sequence Searches
We selected the following sequences for the MSA task:
Multiple Alignments
For building the multiple Alignments the following 20 sequences were chosen:
ID | Identity | Description | |
---|---|---|---|
99-90% Sequence Identity | |||
197102934 | 93% | Aspartoacyclase (Pongo abelii) | |
296476730 | 92% | Aspartoacyclase (Bos taurus) | |
178056458 | 92% | Aspartoacyclase (Sus scrofa) | |
62510428 | 96% | Aspartoacyclase (Macaca fascicularis) | |
60-89% Sequence Identity | |||
19354304 | 82% | Aspartoacyclase (Mus musculus) | |
13242314 | 81% | Aspartoacyclase (Rattus norvegicus) | |
166796542 | 67% | LOC733935 protein [Xenopus (Silurana) tropicalis] | |
227437329 | 65% | Aspartoacyclase (Hypomesus transpacificus) | |
158534041 | 62.5% | Aspartoacyclase (Danio rerio) | |
2jxm_B | 60% | Cytochrome F (Prochlorothrix hollandica) | |
40-59% Sequence Identity | |||
57526957 | 45% | Aspartoacyclase 2 (Rattus norvegicus) | |
89076336 | 40% | Aspartoacyclase (Photobacterium sp.) | |
18087825 | 41% | Aspartoacyclase 2 (Homo sapiens) | |
21309896 | 42% | Hepatitis C virus core-binding protein | |
308321821 | 42% | Aspartoacyclase 2A (Ictalurus furcatus) | |
20-39% Sequence Identity | |||
88807290 | 34% | Aspartoacyclase (Synechococcus sp.) | |
158339037 | 37% | Aspartoacyclase (Acarychloris marina) | |
186683012 | 37% | Aspartoacyclase (Nostoc punctiforme) | |
308050175 | 37% | Aspartoacyclase (Ferrimonas balearica) | |
260771522 | 34% | Aspartoacyclase (Vibrio furnissii) |
ClustalW
Command: clustalw selected.fasta
Cobalt
We had to install cobalt from the web; this being done, we ran it using the following command:
cobalt -i selected.fasta -norps T
Muscle
Command: muscle -in selected.fasta -out msa_muscle.aln
T-Coffee (default parameters)
Command: t_coffee selected.fasta -outfile msa_tcoffee.aln
T-Coffee (3D)
The corresponding mode for this flavor is expresso; it was already preinstalled on the VM used for the practical.
Command: t_coffee selected.fasta -outfile msa_tcoffee_3d.aln -mode expresso
Multiple Sequence Alignments
These are the MSAs that were created in this sub-task; the images were created using JalView.
Number of conserved columns
...
Conservation of functionally important residues
...
Number of Gaps
...
Gaps in Secondary Structure Elements
...