Difference between revisions of "Metachromatic leukodystrophy reference aminoacids"

Revision as of 17:54, 28 August 2011

Sequence

>sp|P15289|ARSA_HUMAN Arylsulfatase A OS=Homo sapiens GN=ARSA PE=1 SV=3 MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG EDPALQICCHPGCTPRPACCHCPDPHA

Source

Uniprot

Database Searches

FASTA, BLAST and PSI-BLAST were run against the non-redundant database (NR). HHsearch was run through the web interface<ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> against the PDB and Interpro database. The following parameter settings were used:

BLAST: blastall -p blastp -i refSeq.fasta -d /data/blast/nr/nr > blastp with refSeq.fasta being the file containing the reference sequence.
PSI-BLAST was run with the following parameter settings:
- e-value cutoff 0.005, 3 iterations (Psi-blast1)
- e-value cutoff 0.005, 5 iterations (Psi-blast2)
- e-value cutoff 10E-6, 3 iterations (Psi-blast3)
- e-value cutoff 10E-6, 5 iterations (Psi-blast4)
- blastpgp -i refSeq.fasta -d /data/blast/nr/nr -e"e-value" -j "#iterations" > psiblast_"e-value"_"#iterations"
HHsearch: We used the online version of hhPred <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> with default parameters. One search was performed against PDB and one against Interpro.

Alignment results

We wrote a perl script to parse the output files of the individual programs and extracted identifier, alignment score and the percentage of identical residues within the alignment.

Mapping of identifier

The non-redundant database contains entries from various databases, including RefSeq, PDB, PIR, PRF, GenBank and Swiss-Prot. In order to compare results of NR database searches with the results of the HHpred searches, a mapping of the IDs is necessary. Furthermore, the entries in HSSP - which is used later to benchmark the alignment results - contains only references to the UniProtKB accession number (ACCNUM). To overcome this problem we downloaded a mapping table between the IDs from <ref>http://pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. This table was used - together with some short perl scripts - to map IDs between the databases and compare the results.

Summary of database searches

In this section, we give a short summary description of the search results of the individual programs and the compare them to each other.

Comparison of the methods

FASTA yielded with 4733 alignments the highest number of hits.
BLAST produced 252 alignments.
PSI-BLAST
- Using an E-value cutoff of 0.005, PSI-BLAST produced 756 alignments for 3 iterations and 1257 for 5 iterations.
- Using an E-value cutoff of 10E-6, PSI-BLAST produced 756 alignments for 3 iterations and 1257 for 5 iterations.
HHsearch produced 33 alignments for the search against PDB and 74 alignments for search against Interpro.

FASTA shows the highest number of alignments, probably due to the fact, that no e-value cutoff was chosen. Contrary, hhsearch has very few alignments. This could be ascribed to the fact that completely different databases were used for the alignments and Interpro and pdb just did not have as much homolguous sequences as the nr database. This is also supported by the benchmark with HSSP (see next section). Another interesting fact is that the results of PSI-BLAST depended for our parameter setting only on the number of iterations. Ragarding the results for the number of iterations, both e-value cutoffs yielded - except of some single exceptions the same aligned target sequences from the database - the same hits.

Overlap between methods

The number of shared target sequences between two methods is shown in the upper panel (self overlaps not shown). The lower panel depicts how many percent of the aligned target sequences of a given method (x-axis) are shared with the other methods.

The results of the four different PSI-BLAST runs show the highest overlap. The additional iterations find more related sequences and yield a higher number of alignments. Interestingly, almost all BLAST hits overlap with the PSI-BLAST and FASTA results. The overlaps of the searches show that the number of hits highly depends on the database. hhpred1 does not have a good overlap with any of the other methods, but hhpred2 shares a significant part of its results with PSI-BLAST.

Scores and identity of aligned residues

In general, the identity of aligned residues is very low, i.e. only very few highly similar hits are detected by the methods. These high scoring matches mainly represent homologs of Arylsulfatase A. This can be seen in the table of sequences which were chosen for the multiple sequence alignment. The majority of hits from an identity range of 99-90% and 89-60% are annotated as Arylsulfatases or Arylsulfatase A. Lowering the identity cutoff yields an increasing number of Sulfatases, which might be more distantly related to our query sequence.
The alignment scores show a very similar distribution. Lowest scores are produced by FASTA, which reflects the low sensitivity of the method for the detection of true homologs. A lot of these matches might be classified as false positive, i.e. they are not evolutionary related to the query sequence. The BLAST scores are a bit elevated compared to FASTA. The highest scores are derived from the PSI-BLAST searches and an increase can be seen when the number of iterations is raised.

The density of alignment scores and percentage of identical residues within the alignments are plotted

HSSP

HSSP is a database which contains information about homologuous protein sequences for each protein in the PDB database, which are very likely to have a similar structure as the query. HSSP uses a position-weighted dynamic programming method for sequence profile alignment of PDB entries against the Swissprot database. Therefore the database can also be used to infer homology based secondary structure predictions.
Now we want to use the homology information in HSSP to benchmark our alignment results. The table below depicts the recall and precision of homologs in HSSP of our query by our alignments. FASTA shows the highest recall (92 %), which could lead to the misleading interpretation, that this method performs best. The high value can be ascribed to the fact, that FASTA reports a very high number of alignments - with a lot of false positive - and a selection of true homologs from this results without prior knowledge is quite challenging. This is reflected by the precision of the method, which is only 23 %, i.e. only 23 % of the hits in fasta are true homologs regarding the HSSP annotation.
BLAST perfoms a little bit better than FASTA and PSI-BLAST perfoms best. The latter shows a recall of around 20 % - which is still quite low - but a precision of 65 %, i.e. 64 % of the hits using the PSI-BLAST method are true homologs.
As hhpred was searched against other databases, which contained much less entries than the NR-database used for the other methods, it is not directly comparable to the other results. It performs much worse when hits are compared to all HSSP homologs. But if we only take HSSP homologs, that are contained in the PDB hhpred perfoms best. It even recalls all 12 HSSP "pdb-homologs".

Method	Recall (GI)	Recall (pdb)	Precision (GI)
FASTA	0.92	0.67	0.23
BLAST	0.11	0.42	0.54
Psi-blast1	0.21	0.42	0.65
Psi-blast2	0.23	0.5	0.62
Psi-blast3	0.21	0.42	0.65
Psi-blast4	0.23	0.5	0.62
hhpred (pdb)	0.01	1	0.11
hhpred (interpro)	0.01	0.92	0.12

Multiple Alignments

For building the multiple Alignments the following sequences were chosen:

SeqIdentifier	Seq Identity	source	Protein function
99-90% Sequence Identity
gi109094666	96.6%	Macaca mulatta	Arylsulfatase A isoform 2
gi281339526	90.8%	Ailuropoda melanoleuca	unknown
gi47522624	91.5%	Sus scrofa	Arylsulfatase A
gi149759319	89.7%	Equus caballus	Arylsulfatase A
gi301763795	90.8%	Ailuropoda melanoleuca	Arylsulfatase A
89-60% Sequence Identity
115497982	87.3%	Bos taurus	Arylsulfatase A precursor
gi118081865	63.4%	Gallus gallus	Arylsulfatase A
gi126339031	74.3%	Monodelphis domestica	Arylsulfatase A
gi114326188	88.4%	Canis lupus familiaris	Arylsulfatase A
gi164519052	85.6%	Rattus norvegicus	Arylsulfatase A
59-40% Sequence Identity
gi223936859	43.9%	Bacterium Ellin514	Sulfatase
1P49	39.0%	Homo Sapiens	Steryl-Sulfatase
gi120537984	56.0%	Xenopus laevis	unknown
gi301625378	55.5%	Xenopus (Silurana) tropicalis	Arylsulfatase A
gi86142609	40.0%	Leeuwenhoekiella blandensis MED217	Arylsulfatase A
39-20% Sequence Identity
1FSU	28.0%	Homo Sapiens	N-Acetylgalactosamine-4-Sulfatase
2VQR	20.0%	Rhizobium leguminosarum	Sulfatase
3ED4	32.0%	Escherichia coli	Arylsulfatase
gi113971721	29.0%	Shewanella sp. MR-4	Sulfatase
gi310635680	36.0%	Planctomyces brasiliensis DSM 5305	Sulfatase

The sequences with <20% and >99% sequence identitiy were ignored and 5 samples were randomly picked from the other ranges. So 20 sequences were available for the multiple alignments. Unfortunately no sequences in the range between 99-90% with known 3D-structure were found, so only sequences without known structure were used here. For the range between 59-40% also no pdb-structure was found, so we used a sequence with 39% sequence identity to have at least one pdb-structure.

Cobalt

Cobalt is a progressive Multiple Alignment Tool which uses a collection of pairwise constraints which are derived from different databases. To reduce computation time, sequence clusters are formed and then only one sequence out of the cluster is used for finding conserverd domains and motif matches. <ref name="papadopoulos2007">Papadopoulos JS and Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences, Bioinformatics 23:1073-79</ref>

Command

time /home/student/Downloads/ncbi-cobalt-2.0.1/cobalt -i MSA_seqs.fasta -norps T > alignments/MSA_cobalt.aln

	time
real	0m6.691s
user	0m3.550s
sys	0m0.240s

Gaps in secondary structure & JalView

View of Alignment built by Cobalt, shown in Jalview

	Position in reference
HELIX	39, 70, 210, 290, 291, 451
STRAND	29, 154, 190, 242, 244, 327, 387, 427
TURN	163, 320

ClustalW

ClustalW is a Multiple Alignment Tool that uses a guide tree to build a multiple sequence alignment. The guide tree is constructed using Neighbor-Joining. <ref name="larkin2007">Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948. </ref>

Command

time clustalw

	time
real	3m28.533s
user	0m9.500s
sys	0m0.110s

Gaps in secondary structure & JalView

View of Alignment built by ClustalW, shown in Jalview

	Position in reference
HELIX	39, 70, 290
STRAND	190, 325
TURN	240, 321

Muscle

Muscle is a Multiple Alignment Tool that builds a guide tree by an approximation algorithm. Then the sequences are progressively aligned. The alignment is then refined using the calculated pairwise distances. As final step a tree-dependent refinement takes place where the tree is partitioned in two partitions and the profiles for the two subsets are again aligned. <ref name="edgar2004">Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797</ref>

Command

time muscle -in MSA_seqs.fasta -out MSA_muscle.aln

	time
real	0m4.236s
user	0m2.550s
sys	0m0.090s

Gaps in secondary structure & JalView

View of Alignment built by muscle, shown in Jalview

	Position in reference
HELIX	39, 70, 132, 139, 197, 246, 291
STRAND	122, 154, 181, 189, 243, 244, 327, 375, 387, 426
TURN	164, 320

T-Coffee

standard parameters

Command

time t_coffee MSA_seqs.fasta

	time
real	1m20.451s
user	0m57.410s
sys	0m1.580s

Gaps in secondary structure & JalView

View of Alignment built by T-Coffee with standard parameters, shown in Jalview

	Position in reference
HELIX	69, 131, 139, 197, 288, 289, 414, 453
STRAND	153, 160, 188, 189, 191, 327, 374, 426
TURN	236, 239, 241, 384

3d-coffee

Command

time t_coffee MSA_seqs.fasta -mode expresso -pdb_type dn

	time
real	21m38.094s
user	11m27.690s
sys	1m32.880s

Gaps in secondary structure & JalView

View of Alignment built by T-Coffee with 3D-coffee, shown in Jalview

	Position in reference
HELIX	69, 131, 137, 194, 205, 211, 246, 290, 450, 452, 453, 459, 466
STRAND	66, 183, 187, 189, 190, 231, 242, 317, 324, 325, 373, 387
TURN	241, 321, 496, 499

Summary

Gaps

For each method used in this section, the number of gapped columns (a column is defined as gapped, if it contains at least one gap), the number of conserved columns (all entries in the row have the same amino acid) and the alignment length were calculated.

	Cobalt	Muscle	ClustalW	T-Coffee	T-Coffee 3D
#gapped columns	415	411	286	523	600
#conserved columns	24	26	22	27	25
length(alignment)	715	753	668	862	896

ClustalW yields the most compact, i.e. shortest multiple sequence alignment (MSA). It exhibits the lowermost overal number of gaps and number of gaps within the secondary structure of the human Arylsulfatase A. The number of 100% conserved columns consequently decreases a bit in comparison to the other methods. Further on the JalView representation shows the most homogenuous alignment of conserved blocks.

Conservation of active sites

As defined before, we only call a Position conserved if has the same amino acid in all sequences.

	Cobalt	Muscle	ClustalW	T-Coffee	T-Coffee 3D
active site (Pos. 125)	20 / 21	1 / 21	20 / 21	20 / 21	20 / 21
Metal binding site (Pos. 29)	21 / 21	3 / 21	21 / 21	21 / 21	21 / 21
Metal binding site (Pos. 30)	21 / 21	1 / 21	20 / 21	20 / 21	20 / 21
Metal binding site (Pos. 69)	16 / 21	13 / 21	16 / 21	16 / 21	16 / 21
Metal binding site (Pos. 281)	21 / 21	1 / 21	21 / 21	21 / 21	21 / 21
Metal binding site (Pos. 282)	19 / 21	1 / 21	19 / 21	19 / 21	19 / 21
Binding site (Pos. 123)	20 / 21	1 / 21	20 / 21	20 / 21	20 / 21
Binding site (Pos. 150)	17 / 21	1 / 21	16 / 21	17 / 21	16 / 21
Binding site (Pos. 229)	20 / 21	1 / 21	21 / 21	21 / 21	21 / 21
Binding site (Pos. 302)	21 / 21	2 ( 21	21 / 21	21 / 21	20 / 21

As shown in the table above, by only looking at important sites, the alignments are very similar, except the muscle-alignment. So the important sites are really well-conserved also in more distant related proteins.

Computation Time

The computation time differed significantelly between the different programs. 3d-coffee took the most time (as expected) while Cobalt and Muscle where the fastest programs on our testset.

Plot of computation times by differenet MSA-programs on ARSA

References

@@ Line 212: / Line 212: @@
 === ClustalW ===
-ClustalW is a Multiple Alignment Tool that uses a guide tree to build a multiple sequence alignment.
+ClustalW is a Multiple Alignment Tool that uses a guide tree to build a multiple sequence alignment. The guide tree is constructed using Neighbor-Joining.
 <ref name="larkin2007">Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG.
 (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948. </ref>

Difference between revisions of "Metachromatic leukodystrophy reference aminoacids"

Revision as of 17:54, 28 August 2011

Contents

Sequence

Source

Database Searches

Alignment results

Mapping of identifier

Summary of database searches

Comparison of the methods

Overlap between methods

Scores and identity of aligned residues

HSSP

Multiple Alignments

Cobalt

Command

Gaps in secondary structure & JalView

ClustalW

Command

Gaps in secondary structure & JalView

Muscle

Command

Gaps in secondary structure & JalView

T-Coffee

standard parameters

Command

Gaps in secondary structure & JalView

3d-coffee

Command

Gaps in secondary structure & JalView

Summary

Gaps

Conservation of active sites

Computation Time

References

Navigation menu

Search