ASPA Sequence Alignments

From Bioinformatikpedia

Sequence Searches

BLAST

Prior to running BLAST, we concatenated all of the individual FASTA files of the nonredundant library into one big file, /data/nr/nr. Since we are working with proteins, BlastP was used. Command: blast -d /data/nr/nr -p blastp -i ../seq.fasta

FASTA

We had to download and build this before being able to using it. Command: fasta36 -q ../seq.fasta /data/nr/nr

PSI Blast

The preinstalled version of PSI Blast died on an error message; to fix this, we replaced /bin/sh with a symlink to /bin/bash. Commands:
blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.00001
blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.00001
blastpgp -i ../seq.fasta -d /data/nr/nr -j 3 -h 0.005
blastpgp -i ../seq.fasta -d /data/nr/nr -j 5 -h 0.005
Runtime for this was surprisingly long at approx. 23 minutes for 3 iterations and ca. 54 minutes for 5 iterations.

HHSearch

We used the downloadable HHSearch binary and the latest snapshot of the HMM database for PDB from ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases. Since this database diverges rather strongly from NR in its composition, we did not includee HHSearch results in any comparisons. The downloaded database (we used the HMM version) had to be concatenated into one single file before HHSearch would accept it as database input; this we did using a small Ruby script. To obtain e-values, our input sequence had to be converted to a hhm file and calibrated against a calibration data set provided on the HHSearch website. This was done via the following commands:
hhmake -i seq.fasta
hhsearch -d cal.hhm -i seq.hhm -cal

We then searched the downloaded HMM database using the newly generated HHM profile for our input sequence:

Command: hhsearch -i ../seq.hhm -d ../hhsearch-db/pdb/db -o hhsearch-ali -p 0 -E 1000 -Z 100000000000 -B 10000000000

By specifying extreme values for p-value threshold, e-value cutoff and the minimum number of sequences to be shown in alignment and hit list, we ensured that low-identity matches were being reported too.


Overlap

We visualize overlap of the results between different programs and settings using Venn diagrams; HHSearch is not included, as the difference in database composition precludes meaningful comparison. For a comparison between Blast, Fasta and PSI-Blast, we chose the PSI-Blast run with the least differences to the other PSI-Blast runs. In this particular case, the difference is fully negligible.

Overlap of hits between Blast, Fasta and Psi-Blast.
Overlap of hits between different Psi-Blast runs


Score Distribution

Score distribution of the sequence search methods used

Identity Distribution

Identity distribution of the sequence search methods used


HSSP Recall

Identity distribution of the sequence search methods used


Multiple Sequence Alignments

Sequences selected from Sequence Searches

We selected the following sequences for the MSA task:

Multiple Alignments

For building the multiple Alignments the following 20 sequences were chosen:

ID Identity Description
99-90% Sequence Identity
197102934 93% Aspartoacyclase (Pongo abelii)
296476730 92% Aspartoacyclase (Bos taurus)
178056458 92% Aspartoacyclase (Sus scrofa)
62510428 96% Aspartoacyclase (Macaca fascicularis)
60-89% Sequence Identity
19354304 82% Aspartoacyclase (Mus musculus)
13242314 81% Aspartoacyclase (Rattus norvegicus)
166796542 67% LOC733935 protein [Xenopus (Silurana) tropicalis]
227437329 65% Aspartoacyclase (Hypomesus transpacificus)
158534041 62.5% Aspartoacyclase (Danio rerio)
2jxm_B 60% Cytochrome F (Prochlorothrix hollandica)
40-59% Sequence Identity
57526957 45% Aspartoacyclase 2 (Rattus norvegicus)
89076336 40% Aspartoacyclase (Photobacterium sp.)
18087825 41% Aspartoacyclase 2 (Homo sapiens)
21309896 42% Hepatitis C virus core-binding protein
308321821 42% Aspartoacyclase 2A (Ictalurus furcatus)
20-39% Sequence Identity
88807290 34% Aspartoacyclase (Synechococcus sp.)
158339037 37% Aspartoacyclase (Acarychloris marina)
186683012 37% Aspartoacyclase (Nostoc punctiforme)
308050175 37% Aspartoacyclase (Ferrimonas balearica)
260771522 34% Aspartoacyclase (Vibrio furnissii)

Both Cytochrome F and Hepatitis C do not make much sense in this collection; however, we could not find any other more suitable sequences in our search results and therefore decided to include those to see what would happen; as it turns out, Cytochrome F affects the MSAs rather strongly, whereas Hepatitis C does not seem to have much impact and actually seems to have a surprisingly strong similarity to the aspartoacyclase sequences.


ClustalW

Command: clustalw selected.fasta


Cobalt

We had to install cobalt from the web; this being done, we ran it using the following command:
cobalt -i selected.fasta -norps T


Muscle

Command: muscle -in selected.fasta -out msa_muscle.aln


T-Coffee (default parameters)

Command: t_coffee selected.fasta -outfile msa_tcoffee.aln

T-Coffee (3D)

The corresponding mode for this flavor is expresso; it was already preinstalled on the VM used for the practical.

Command: t_coffee selected.fasta -outfile msa_tcoffee_3d.aln -mode expresso

Multiple Sequence Alignments

These are the MSAs that were created in this sub-task; the images were created using JalView.

ClustalW MSA
T-Coffee MSA
Cobalt MSA
Muscle MSA
Muscle MSA

Number of conserved columns

Method Number of conserved columns Comment
Cobalt 7 Can be increased to 31 by ignoring the Cytochrome F sequence
ClustalW 3 Can be increased to 41 by ignoring Cytochrome F and Hypomesus transpacificus Aspartoacyclase
Muscle 2 Can be increased to 42 by ignoring Cytochrome F
T-Coffee 6 Can be increased to 89 by ignoring Cytochrome F
3D-Coffee 1 Can be increased to 45 by ignoring Cytochrome F

The effect of Cytochrome F came as no surprise; what is surprising in this context is that we could not (at least without recalculating the alignments) find such effects with Hepatitis C.

Conservation of functionally important residues

The putative glycosylating residue is the aspartic acid at position 117. We checked two residues to the left and right for mutations. In three, 88807290, 158339037 and 186683012, sequences the functional residue was replaced with valin. Since all of them had sequence identities of 34-37% only, this is not especially surprising.

In 2JXM, the functional site was found to be quite completely destroyed. This is interesting since 2JXM was extracted from the Prochlorothrix hollandica algae, whereas some of the other proteins actually are of bacterial origin. We assume that 2JXM is not a true homolog and therefore slightly out of place in these alignments.

Number of Gaps

Sequence ClustalW T-Coffee 3D-Coffee Muscle Cobalt
197102934|ref|NP_001126392. 8 85 102 49 15
62510428|sp|Q60HH2.1|ACY2_M 8 106 83 26 15
296476730|gb|DAA18845.1| 8 85 101 33 15
178056458|ref|NP_001116549. 8 85 102 27 15
19354304|gb|AAH24934.1| 9 86 102 32 16
13242314|ref|NP_077375.1| 7 84 100 29 14
166796542|gb|AAI59076.1| 9 85 102 27 16
227437329|gb|ACP30427.1| 6 163 72 29 11
158534041|ref|NP_001103573. 8 83 98 28 13
57526957|ref|NP_001009603.1 8 149 109 28 29
21309896|gb|AAM46090.1|AF37 9 79 100 29 15
18087825|ref|NP_542389.1| 8 103 100 21 26
308321821|gb|ADO28053.1| 9 79 96 28 15
89076336|ref|ZP_01162673.1| 12 80 103 31 16
260771522|ref|ZP_05880446.1 12 84 90 31 14
308050175|ref|YP_003913741. 12 91 94 29 26
158339037|ref|YP_001520214. 14 103 95 30 26
186683012|ref|YP_001866208. 12 104 95 30 26
88807290|ref|ZP_01122802.1| 11 92 94 25 17
PDBID|CHAIN|SEQUENCE 24 97 99 30 26

Gaps in Secondary Structure Elements

...