Difference between revisions of "Sequence Alignments BCKDHA"

Revision as of 19:07, 23 May 2011

Sequence Alignments

Sequence searches

FASTA

../bin/fasta36 sequence.fasta database > FastaOutput.txt

BLAST

blastall -p blastp -d database -i sequence.fasta > BlastOutput.txt

PSIBLAST

blastpgp -i sequence.fasta -j iterations -h evalueCutoff -d database > PsiblastOutput.txt

HHSearch

hhsearch -i query -d database -o output.txt

database = /data/blast/nr/nr

Result Statistics

Overlap

To illustrate the overlap of the returned sequences for the sequences searches, Venn Diagrams were drawn.

Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)

Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)

The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. This fact was used to combine their results for the following Venn Diagramm (created with [6]):

Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. BLAST returned only one additional Hit, that was not found in any of the PSI-BLAST searches, but which is also included in the FASTA results. Any PSI-BLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more Hits were returned than using PSIBLAST with only 3 iterations. This may be due to a longer search time. All in all one could say, that the Search algorithms returned nearly the same sequences matching our query.

Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown below.

Identity Distribution

Identity Distribution for the results from the Sequence Searches

As the PSIBlast runs with each 3 iterations resulted in the same Hits, as well as the PSI-BLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 Hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 Hits each. Therefore a lot of the FASTA 'Hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.

Evalue Distribution

E-value Distribution for the results from the Sequence Searches

The E-value distribution for the sequence searches is quite similar concerining the BLAST and PSI-BLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more Hits, among which a lot of sequences have little identity and therefore a quite high E-value.

HSSP recall

The comparison with the HSSP alignment for our protein BCKDHA posed another challenge: HSSP alignment files use Swissprot/EMBL Identifier which had to be mapped to GI numbers to be comparable to our sequence files. The mapping was performed using the ID Mapping tool provided by NCBI[7]. The 3350 Swissprot Identifiers were mapped to almost 6000 GI numbers, but still about 500 identifiers could not be mapped at all. Doing the mapping the other way round and converting all GI numbers from the BLAST and FASTA runs, the same problem arised. For the recall evaluation the second mapping was chosen to display and the corresponding plot can be seen below. As expected from the large number of Hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.

Afterwards the overlap of sequences for the different alignment methods were compared to the HSSP sequences.

Sequences chosen for the multiple Alignment:

SeqIdentifier	Seq Identity	source
99-90% Sequence Identity
56967006\|pdb\|1X7Z	99%	PSI BLAST, 3 iterations, E-value cutoff 0.005
7546384\|pdb\|1DTW	95%	BLAST
34810149\|pdb\|1OLU	99%	PSI BLAST, 3 iterations, E-value cutoff 10E-6
13277798\|gb\|AAH03787.1	95%	PSI BLAST, 3 iterations, E-value cutoff 10E-6
148727347\|ref\|NP_001092034.1	95%	BLAST
89-60% Sequence Identity
196011048\|ref\|XP_002115388.1	66%	PSI BLAST, 3 iterations, E-value cutoff 0.005
149543950\|ref\|XP_001517857.1	67%	BLAST
47227873\|emb\|CAG09036.1	82,5%	FASTA
47196273\|emb\|CAF88112.1	81%	PSI BLAST, 5 iterations, E-value cutoff 0.005
12964598\|dbj\|BAB32665.1	88%	PSI BLAST, 5 iterations, E-value cutoff 10E-6
59-40% Sequence Identity
193290664\|gb\|ACF17640.1	47%	BLAST
215431443\|ref\|ZP_03429362.1	40%	FASTA
225557347\|gb\|EEH05633.1	51%	PSI BLAST, 3 iterations, E-value cutoff 10E-6
58267618\|ref\|XP_570965.1	50%	PSI BLAST, 5 iterations, E-value cutoff 0.005
162449842\|ref\|YP_001612209.1	41%	PSI BLAST, 5 iterations, E-value cutoff 10E-6
39-20% Sequence Identity
56966700\|pdb\|1W85	31%	PSI BLAST, 3 iterations, E-value cutoff 0.005
5822330\|pdb\|1QS0	38.1%	FASTA
13516864\|dbj\|BAB40585.1	33%	PSI BLAST, 3 iterations, E-value cutoff 10E-6
284166853\|ref\|YP_003405132.1	35%	PSI BLAST, 5 iterations, E-value cutoff 0.005
76800932\|ref\|YP_325940.1	34%	PSI BLAST, 5 iterations, E-value cutoff 10E-6

Sequences for the Multiple Sequences Alignment were downloaded via NCBI, the sequence id can be changed in the link to retrieve the fasta format: http://www.ncbi.nlm.nih.gov/protein/76800932?report=fasta

Multiple Alignments

ClustalW

clustalw sequences.fasta

T-Coffee

t_coffee -seq sequences.fasta

T-Coffee(3D)

t_coffee -seq sequences.fasta -mode expresso

Muscle

muscle -in sequences.fasta -out output.aln

Cobalt

download cobalt

./cobalt -i sequences.fasta -norps T > output.aln

Conservation and Gaps

Alignment methods	Gaps		Conserved Columns
Alignment methods	Gaps	Avg Gap Length	100% cons	>90% cons	>80% cons	>70% cons	>60% cons	>50% cons
ClustalW	12	3,75	24	50	31	49	54	72
T-Coffee	25	4,56	24	50	31	49	54	72
T-Coffee (3D)	56	4,75	21	45	34	49	64	71
Cobalt	19	3,26	24	55	31	45	60	71
Muscle	17	4,76	26	46	22	31	14	8

Gaps in secondary structure

ClustalW

ClustalW gaps and structure [1]

Gap position	Gap length	Secondary structure
109-110	4	Helix
142-143	1	Helix
235-236	1	Beta strand
276-277	11	Helix
294-295	1	Beta strand
394-395	5	Helix

T-Coffee

T-Coffee gaps and structure [2]

Gap position	Gap length	Secondary structure
141-142	1	Helix
232-233	1	Beta strand
275-276	11	Helix
310-311	1	Helix
369-370	5	Turn
395-396	18	Helix
398-399	5	Helix

T-Coffee 3d

T-Coffee (3D) gaps and structure [3]

Gap position	Gap length	Secondary structure
101-102	1	Helix
108-109	4	Helix
115-116	1	Helix
116-117	1	Helix
141-142	1	Helix
153-154	1	Beta strand
163-164	1	Helix
177-178	3	Helix
234-235	1	Beta strand
263-264	4	Beta strand
265-266	1	Beta strand
276-277	2	Helix
308-309	8	Helix
309-310	5	Helix
314-315	6	Helix
362-363	6	Helix
371-372	4	Turn
376-377	1	Helix
380-381	1	Helix
382-383	7	Helix
383-384	3	Helix
384-385	2	Helix
387-388	2	Helix
394-395	5	Helix

Cobalt

Cobalt gaps and structure [4]

Gap position	Gap length	Secondary structure
108-109	4	Helix
141-142	1	Helix
177-178	3	Helix
276-277	11	Helix
294-295	1	Beta strand
305-306	1	Helix
311-312	2	Helix
387-388	1	Helix
388-389	1	Helix
395-396	13	Helix

Muscle

Muscle gaps and structure [5]

Gap position	Gap length	Secondary structure
109-110	4	Helix
141-142	1	Helix
177-178	3	Helix
276-277	11	Helix
294-295	1	Beta strand
305-306	1	Helix
394-395	5	Helix

Functionally important residues

The functionally important sites are according to [8] the following sites:

Metal binding site, 206
Metal binding site, 211
Metal binding site, 212

References

Secondary structure information

back to Reference Sequence of BCKDHA

back to Maple syrup urine disease main page

Difference between revisions of "Sequence Alignments BCKDHA"

Revision as of 19:07, 23 May 2011

Contents

Sequence Alignments

Sequence searches

Multiple Alignments

Conservation and Gaps

Gaps in secondary structure

ClustalW

T-Coffee

T-Coffee 3d

Cobalt

Muscle

Functionally important residues

References

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools

@@ Line 51: / Line 51: @@
 Doing the mapping the other way round and converting all GI numbers from the BLAST and FASTA runs, the same problem arised.
 For the recall evaluation the second mapping was chosen to display and the corresponding plot can be seen below.
-[[File:BCKDHA_ValidateHSSP.png]]
+[[File:BCDKHA_ValidateHSSP.png]]
 As expected from the large number of Hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.