Metachromatic leukodystrophy reference aminoacids
Sequence
>sp|P15289|ARSA_HUMAN Arylsulfatase A OS=Homo sapiens GN=ARSA PE=1 SV=3
MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT
DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM
AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP
LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE
RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC
GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP
LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL
TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG
EDPALQICCHPGCTPRPACCHCPDPHA
Source
Database Searches
FASTA, BLAST and PSI-BLAST were run against the non-redundant database (NR). HHsearch was run through the web interface<ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> aigainst the PDB and Interpro database. The following parameter settings were used:
- BLAST:
blastall -p blastp -i refSeq.fasta -d /data/blast/nr/nr > blastp
with refSeq.fasta being the file containing the reference sequence and blastp the * PSI-BLAST:blastpgp -i refSeq.fasta -d /data/blast/nr/nr -e"e-value" -j "#iterations" > psiblast_"e-value"_"#iterations"
- PSI-BLAST was run with the following parameter settings:
- e-value cutoff 0.005, 3 iterations (Psi-blast1)
- e-value cutoff 0.005, 5 iterations (Psi-blast2)
- e-value cutoff 10E-6, 3 iterations (Psi-blast3)
- e-value cutoff 10E-6, 5 iterations (Psi-blast4)
- HHsearch: We used the online version of hhPred <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> with default parameters. One search was performed against PDB and one against Interpro.
Alignment results
We wrote a perl script to parse the output files of the individual programs and extracted identifier, alignment score and the percentage of identical residues within the alignment.
Mapping of identifier
The non-redundant database contains entries from various databases, including RefSeq, PDB, PIR, PRF, GenBank and Swiss-Prot. In order to compare results of NR database searches with the results of the HHpred searches, a mapping of the IDs is necessary. Furthermore, the entries in HSSP - which is used later to benchmark the alignment results - contains only references to the UniProtKB accession number (ACCNUM). To overcome this problem we downloaded a mapping table between the IDs from <ref>http://pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. This table was used - together with some short perl scripts - to map IDs between the databases and compare the results.
Summary of database searches
In this section, we give a short summary description of the search results of the individual programs and the compare them to each other.
Comparison of the methods
- FASTA yielded with 4733 alignments the highest number of hits.
- BLAST produced 252 alignments.
- PSI-BLAST
- Using an E-value cutoff of 0.005, PSI-BLAST produced 756 alignments for 3 iterations and 1257 for 5 iterations.
- Using an E-value cutoff of 10E-6, PSI-BLAST produced 756 alignments for 3 iterations and 1257 for 5 iterations.
- HHsearch produced 33 alignments for the search against PDB and 74 alignments for search against Interpro.
FASTA shows the highest number of alignments, probably due to the fact, that no e-value cutoff was chosen. Contrary, hhsearch has very few alignments. This could be ascribed to the fact, that completely different databases were used for the alignments and Interpro and pdb just did not have as much homolguous sequences as the nr database. This is also supported by the benchmark with HSSP (see next section). Aother interesting fact is that the results of PSI-BLAST depended for our parameter setting only on the number of iterations. Ragarding the results for the number of iterations, both e-value cutoffs yielded - except of some single exceptions the same aligned target sequences from the database - the same hits.
Overlap between methods
The results of the four different PSI-BLAST runs show the highest overlap. The additional iterations find more related sequences and yield a higher number of alignments. Interestingly, almost all BLAST hits overlap with the PSI-BLAST and FASTA results. The overlaps of the searches, show that the number of hits highly depend on the database. hhpred1 does not have a good overlap with any of the other methods, but hhpred2 shares a significant part of its results with PSI-BLAST.
Scores and identity of aligned residues
In general, the identity of aligned residues is very low, i.e. only very few highly similar hits are detected by the methods. These high scoring matches mainly represent homologs of Arylsulfatase A. This can be seen in the table of sequences which were chosen for the multiple sequence alignment. The majority of hits from an identity range of 99-90% and 89-60% are annotated as Arylsulfatases or Arylsulfatase A. Lowering the identity cutoff yields an increasing number of Sulfatases, which might be more distantly related to our query sequence.
The alignment scores show a very similar distribution. Lowest scores are produced by FASTA, which reflects the low sensitivity of the method for the detection of true homologs. A lot of these matches might be classified as false positive, i.e. they are not evolutionary related to the query sequence. The BLAST scores are a bit elevated compared to FASTA. The highest scores are derived from the PSI-BLAST searches and an increase can be seen when the number of iterations is raised.
HSSP
HSSP is a database contains for each protein in the PDB database information about homologuous protein sequences, which are very likely to have a similar structure as the query. HSSP uses position-weighted dynamic programming method for sequence profile alignment of PDB entries against the Swissprot database. Therefore the database can also be used to infer homology based secondary structure predictions.
Now we want use the homology information in HSSP to benchmark our alignment results. The table below depicts the recall and precision of homologs in HSSP of our query by our alignments. FASTA shows the highest recall (92 %), which could lead to the misleading interpretation, that this method performs best. The high value can be ascribed to the fact, that FASTA reports a very high number of alignments - with a lot of false positive - and a selection of true homologs from this results without prior knowledge is quite challenging. This is reflect by the precision of the method, which is only 23 %, i.e. only 23 % of the hits in fasta are true homologs regarding the HSSP annotation.
BLAST perfoms a little bit better than FASTA and PSI-BLAST perfoms best. The latter shows a recall of around 20 % - which is still quite low - but a precision of 65 %, i.e. 64 % of the hits using the PSI-BLAST method are true homologs.
As hhpred was searched against other databases, which contained much less entries than the NR-database used for the other methods, it is not directly comparable to the other results. It performs much worse when hits are compared to all HSSP homologs. But if we only take HSSP homologs, that are contained in the PDB hhpred perfoms best. It even recalls all 12 HSSP "pdb-homologs".
Method | Recall (GI) | Recall (pdb) | Precision (GI) |
FASTA | 0.92 | 0.67 | 0.23 |
BLAST | 0.11 | 0.42 | 0.54 |
Psi-blast1 | 0.21 | 0.42 | 0.65 |
Psi-blast2 | 0.23 | 0.5 | 0.62 |
Psi-blast3 | 0.21 | 0.42 | 0.65 |
Psi-blast4 | 0.23 | 0.5 | 0.62 |
hhpred (pdb) | 0.01 | 1 | 0.11 |
hhpred (interpro) | 0.01 | 0.92 | 0.12 |
Multiple Alignments
For building the multiple Alignments the following sequences were chosen:
SeqIdentifier | Seq Identity | source | Protein function |
---|---|---|---|
99-90% Sequence Identity | |||
gi109094666 | 96.6% | Macaca mulatta | Arylsulfatase A isoform 2 |
gi281339526 | 90.8% | Ailuropoda melanoleuca | unknown |
gi47522624 | 91.5% | Sus scrofa | Arylsulfatase A |
gi149759319 | 89.7% | Equus caballus | Arylsulfatase A |
gi301763795 | 90.8% | Ailuropoda melanoleuca | Arylsulfatase A |
89-60% Sequence Identity | |||
115497982 | 87.3% | Bos taurus | Arylsulfatase A precursor |
gi118081865 | 63.4% | Gallus gallus | Arylsulfatase A |
gi126339031 | 74.3% | Monodelphis domestica | Arylsulfatase A |
gi114326188 | 88.4% | Canis lupus familiaris | Arylsulfatase A |
gi164519052 | 85.6% | Rattus norvegicus | Arylsulfatase A |
59-40% Sequence Identity | |||
gi223936859 | 43.9% | Bacterium Ellin514 | Sulfatase |
1P49 | 39.0% | Homo Sapiens | Steryl-Sulfatase |
gi120537984 | 56.0% | Xenopus laevis | unknown |
gi301625378 | 55.5% | Xenopus (Silurana) tropicalis | Arylsulfatase A |
gi86142609 | 40.0% | Leeuwenhoekiella blandensis MED217 | Arylsulfatase A |
39-20% Sequence Identity | |||
1FSU | 28.0% | Homo Sapiens | N-Acetylgalactosamine-4-Sulfatase |
2VQR | 20.0% | Rhizobium leguminosarum | Sulfatase |
3ED4 | 32.0% | Escherichia coli | Arylsulfatase |
gi113971721 | 29.0% | Shewanella sp. MR-4 | Sulfatase |
gi310635680 | 36.0% | Planctomyces brasiliensis DSM 5305 | Sulfatase |
The sequences with <20% and >99% sequence identitiy were ignored and 5 samples were randomly picked from the other ranges. So 20 sequences were available for the multiple alignments. Unfortunately no sequences in the range between 99-90% with known 3D-structure were found, so only sequences without known structure were used here. For the range between 59-40% also no pdb-structure was found, so we used a sequence with 39% sequence identity to have at least one pdb-structure.
Cobalt
Command
time /home/student/Downloads/ncbi-cobalt-2.0.1/cobalt -i MSA_seqs.fasta -norps T > alignments/MSA_cobalt.aln
time | |
real | 0m6.691s |
user | 0m3.550s |
sys | 0m0.240s |
ClustalW
Command
time clustalw
time | |
real | 3m28.533s |
user | 0m9.500s |
sys | 0m0.110s |
Muscle
Command
time muscle -in MSA_seqs.fasta -out MSA_muscle.aln
time | |
real | 0m4.236s |
user | 0m2.550s |
sys | 0m0.090s |
T-Coffee
standard parameters
Command
time t_coffee MSA_seqs.fasta
real 1m20.451s user 0m57.410s sys 0m1.580s
3d-coffee
Command
time t_coffee MSA_seqs.fasta -mode expresso -pdb_type dn
real 21m38.094s user 11m27.690s sys 1m32.880s
Gaps and conservation
Cobalt | Muscle | ClustalW | T-Coffee | |
#gaps | 415 | 411 | ||
#conserved columns | 24 | 26 | ||
length(alignment) | 715 | 753 |
References
<references />