Difference between revisions of "ASPA Sequence Alignments"

Revision as of 21:57, 31 May 2011

Sequence Searches

BLAST

Prior to running BLAST, we concatenated all of the individual FASTA files of the nonredundant library into one big file, /data/nr/nr. Since we are working with proteins, BlastP was used. Command: blast -d /data/nr/nr -p blastp -i ../seq.fasta

FASTA

We had to download and build this before being able to using it. Command: fasta36 -q ../seq.fasta /data/nr/nr

PSI Blast

The preinstalled version of PSI Blast died on an error message; to fix this, we replaced /bin/sh with a symlink to /bin/bash. Commands:
blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.00001 blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.00001 blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.005 blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.005 Runtime for this was surprisingly long at approx. 23 minutes for 3 iterations and ca. 54 minutes for 5 iterations.

HHSearch

We used the downloadable HHSearch binary and the latest snapshot of the HMM database for PDB from ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases. Since this database diverges rather strongly from NR in its composition, we did not includee HHSearch results in any comparisons. The downloaded database (we used the HMM version) had to be concatenated into one single file before HHSearch would accept it as database input; this we did using a small Ruby script. To obtain e-values, our input sequence had to be converted to a hhm file and calibrated against a calibration data set provided on the HHSearch website. This was done via the following commands:
hhmake -i seq.fasta hhsearch -d cal.hhm -i seq.hhm -cal

We then searched the downloaded HMM database using the newly generated HHM profile for our input sequence:

Command: hhsearch -i ../seq.hhm -d ../hhsearch-db/pdb/db -o hhsearch-ali -p 0 -E 1000 -Z 100000000000 -B 10000000000

By specifying extreme values for p-value threshold, e-value cutoff and the minimum number of sequences to be shown in alignment and hit list, we ensured that low-identity matches were being reported too.

Overlap

We visualize overlap of the results between different programs and settings using Venn diagrams; HHSearch is not included, as the difference in database composition precludes meaningful comparison. For a comparison between Blast, Fasta and PSI-Blast, we chose the PSI-Blast run with the least differences to the other PSI-Blast runs. In this particular case, the difference is fully negligible.

Overlap of hits between Blast, Fasta and Psi-Blast.

Overlap of hits between different Psi-Blast runs

Score Distribution

Identity Distribution

HSSP Recall

Multiple Sequence Alignments

Sequences selected from Sequence Searches

We selected the following sequences for the MSA task:

Multiple Alignments

For building the multiple Alignments the following 20 sequences were chosen:

ID	Identity	Description
99-90% Sequence Identity
197102934	93%	Aspartoacyclase (Pongo abelii)
296476730	92%	Aspartoacyclase (Bos taurus)
178056458	92%	Aspartoacyclase (Sus scrofa)
62510428	96%	Aspartoacyclase (Macaca fascicularis)
60-89% Sequence Identity
19354304	82%	Aspartoacyclase (Mus musculus)
13242314	81%	Aspartoacyclase (Rattus norvegicus)
166796542	67%	LOC733935 protein [Xenopus (Silurana) tropicalis]
227437329	65%	Aspartoacyclase (Hypomesus transpacificus)
158534041	62.5%	Aspartoacyclase (Danio rerio)
2jxm_B	60%	Cytochrome F (Prochlorothrix hollandica)
40-59% Sequence Identity
57526957	45%	Aspartoacyclase 2 (Rattus norvegicus)
89076336	40%	Aspartoacyclase (Photobacterium sp.)
18087825	41%	Aspartoacyclase 2 (Homo sapiens)
21309896	42%	Hepatitis C virus core-binding protein
308321821	42%	Aspartoacyclase 2A (Ictalurus furcatus)
20-39% Sequence Identity
88807290	34%	Aspartoacyclase (Synechococcus sp.)
158339037	37%	Aspartoacyclase (Acarychloris marina)
186683012	37%	Aspartoacyclase (Nostoc punctiforme)
308050175	37%	Aspartoacyclase (Ferrimonas balearica)
260771522	34%	Aspartoacyclase (Vibrio furnissii)

ClustalW

Command: clustalw selected.fasta

Cobalt

We had to install cobalt from the web; this being done, we ran it using the following command:
cobalt -i selected.fasta -norps T

Muscle

Command: muscle -in selected.fasta -out msa_muscle.aln

T-Coffee (default parameters)

Command: t_coffee selected.fasta -outfile msa_tcoffee.aln

T-Coffee (3D)

The corresponding mode for this flavor is expresso; it was already preinstalled on the VM used for the practical.

Command: t_coffee selected.fasta -outfile msa_tcoffee_3d.aln -mode expresso

Multiple Sequence Alignments

These are the MSAs that were created in this sub-task; the images were created using JalView.

ClustalW MSA

T-Coffee MSA

Cobalt MSA

Muscle MSA

Number of conserved columns

...

Conservation of functionally important residues

...

Number of Gaps

...

Gaps in Secondary Structure Elements

...

@@ Line 47: / Line 47: @@
 ===HSSP Recall===
+[[File:Aspa hssp.png|500px|Identity distribution of the sequence search methods used]]
-...

Difference between revisions of "ASPA Sequence Alignments"

Revision as of 21:57, 31 May 2011

Contents

Sequence Searches

BLAST

FASTA

PSI Blast

HHSearch

Overlap

Score Distribution

Identity Distribution

HSSP Recall

Multiple Sequence Alignments

Sequences selected from Sequence Searches

Multiple Alignments

ClustalW

Cobalt

Muscle

T-Coffee (default parameters)

T-Coffee (3D)

Multiple Sequence Alignments

Number of conserved columns

Conservation of functionally important residues

Number of Gaps

Gaps in Secondary Structure Elements

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools