Sequence Alignments BCKDHA
Contents
Sequence Alignments
Sequence searches
In order to find homolog sequences to our query protein BCKDHA we used the following tools:
- FASTA
../bin/fasta36 sequence.fasta database > FastaOutput.txt
- BLAST
blastall -p blastp -d database -i sequence.fasta > BlastOutput.txt
- PSIBLAST
blastpgp -i sequence.fasta -j iterations -h evalueCutoff -d database > PsiblastOutput.txt
- HHSearch
hhsearch -i query -d database -o output.txt
database = /data/blast/nr/nr
Result Statistics
- Overlap
To illustrate the overlap of the returned sequences for the sequence searches, Venn Diagrams were drawn.
The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See Figure 1 and Figure 2) This fact was used to combine their results for the Venn Diagramm given in Figure 3 (created with [6]):
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. BLAST returned only one additional hit, that was not found in any of the PSIBLAST searches, but which is also included in the FASTA results. Any PSIBLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile. All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits.
Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in Figure 4.
- Identity Distribution
Figure 5 shows the Identity Distribution of the results from the different search tools.
As the PSIBLAST runs with each 3 iterations resulted in the same hits, as well as the PSIBLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 hits each. Therefore a lot of the FASTA 'hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.
- Evalue Distribution
In Figure 6 the E-values for the sequence search results are displayed.
The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSIBLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more hits, among a lot of sequences having little identity and therefore a quite high E-value.
- HSSP recall
To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made: HSSP uses Uniprot identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL. The mapping was performed using the ID Mapping tool provided by NCBI[7]. Figure 7 shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP.
As expected from the large number of hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.
Precision and recall for the different alignment methods calculated.
Blast | Fasta | HHSearch | Psiblast,3,0.005 | Psiblast,5,0.005 | Psiblast,3,10E6 | Psiblast,5,0.005 | |
---|---|---|---|---|---|---|---|
Precision | 0.97 | 0.37 | 0.20 | 0.89 | 0.87 | 0.89 | 0.87 |
Recall | 0.07 | 0.24 | 0 | 0.07 | 0.07 | 0.07 | 0.07 |
The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results.
Sequences chosen for the multiple Alignment:
SeqIdentifier | Seq Identity | source |
---|---|---|
99-90% Sequence Identity | ||
56967006|pdb|1X7Z | 99% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
7546384|pdb|1DTW | 95% | BLAST |
34810149|pdb|1OLU | 99% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
13277798|gb|AAH03787.1 | 95% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
148727347|ref|NP_001092034.1 | 95% | BLAST |
89-60% Sequence Identity | ||
196011048|ref|XP_002115388.1 | 66% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
149543950|ref|XP_001517857.1 | 67% | BLAST |
47227873|emb|CAG09036.1 | 82,5% | FASTA |
47196273|emb|CAF88112.1 | 81% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
12964598|dbj|BAB32665.1 | 88% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
59-40% Sequence Identity | ||
193290664|gb|ACF17640.1 | 47% | BLAST |
215431443|ref|ZP_03429362.1 | 40% | FASTA |
225557347|gb|EEH05633.1 | 51% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
58267618|ref|XP_570965.1 | 50% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
162449842|ref|YP_001612209.1 | 41% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
39-20% Sequence Identity | ||
56966700|pdb|1W85 | 31% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
5822330|pdb|1QS0 | 38.1% | FASTA |
13516864|dbj|BAB40585.1 | 33% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
284166853|ref|YP_003405132.1 | 35% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
76800932|ref|YP_325940.1 | 34% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
Sequences for the Multiple Sequences Alignment were downloaded via NCBI, the sequence id can be changed in the link to retrieve the fasta format: http://www.ncbi.nlm.nih.gov/protein/76800932?report=fasta
Multiple Alignments
The following tools were used to create a multiple sequence alignment:
- ClustalW
clustalw sequences.fasta
- T-Coffee
t_coffee -seq sequences.fasta
- T-Coffee(3D)
t_coffee -seq sequences.fasta -mode expresso
- Muscle
muscle -in sequences.fasta -out output.aln
- Cobalt
download cobalt
./cobalt -i sequences.fasta -norps T > output.aln
Conservation and Gaps
Alignment methods | Gaps | Conserved Columns | ||||||
---|---|---|---|---|---|---|---|---|
Gaps | Avg Gap Length | 100% cons | >90% cons | >80% cons | >70% cons | >60% cons | >50% cons | |
ClustalW | 12 | 3,75 | 24 | 50 | 31 | 49 | 54 | 72 |
T-Coffee | 25 | 4,56 | 24 | 50 | 31 | 49 | 54 | 72 |
T-Coffee (3D) | 56 | 4,75 | 21 | 45 | 34 | 49 | 64 | 71 |
Cobalt | 19 | 3,26 | 24 | 55 | 31 | 45 | 60 | 71 |
Muscle | 17 | 4,76 | 26 | 46 | 22 | 31 | 14 | 8 |
Gaps in secondary structure
ClustalW
The gaps in secondary structure elements for the ClustalW alignment are shown in Figure 8
Gap position | Gap length | Secondary structure |
109-110 | 4 | Helix |
142-143 | 1 | Helix |
235-236 | 1 | Beta strand |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
394-395 | 5 | Helix |
T-Coffee
The gaps in secondary structure elements for the T-Coffee alignment are shown in Figure 9
Gap position | Gap length | Secondary structure |
141-142 | 1 | Helix |
232-233 | 1 | Beta strand |
275-276 | 11 | Helix |
310-311 | 1 | Helix |
369-370 | 5 | Turn |
395-396 | 18 | Helix |
398-399 | 5 | Helix |
T-Coffee 3D
The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in Figure 10
Gap position | Gap length | Secondary structure |
101-102 | 1 | Helix |
108-109 | 4 | Helix |
115-116 | 1 | Helix |
116-117 | 1 | Helix |
141-142 | 1 | Helix |
153-154 | 1 | Beta strand |
163-164 | 1 | Helix |
177-178 | 3 | Helix |
234-235 | 1 | Beta strand |
263-264 | 4 | Beta strand |
265-266 | 1 | Beta strand |
276-277 | 2 | Helix |
308-309 | 8 | Helix |
309-310 | 5 | Helix |
314-315 | 6 | Helix |
362-363 | 6 | Helix |
371-372 | 4 | Turn |
376-377 | 1 | Helix |
380-381 | 1 | Helix |
382-383 | 7 | Helix |
383-384 | 3 | Helix |
384-385 | 2 | Helix |
387-388 | 2 | Helix |
394-395 | 5 | Helix |
Cobalt
The gaps in secondary structure elements for the Cobalt alignment are shown in Figure 11
Gap position | Gap length | Secondary structure |
108-109 | 4 | Helix |
141-142 | 1 | Helix |
177-178 | 3 | Helix |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
305-306 | 1 | Helix |
311-312 | 2 | Helix |
387-388 | 1 | Helix |
388-389 | 1 | Helix |
395-396 | 13 | Helix |
Muscle
The gaps in secondary structure elements for the Muscle alignment are shown in Figure 12
Gap position | Gap length | Secondary structure |
109-110 | 4 | Helix |
141-142 | 1 | Helix |
177-178 | 3 | Helix |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
305-306 | 1 | Helix |
394-395 | 5 | Helix |
Discussion
ClustalW, T-Coffee, Cobalt and Muscle all produced about the same amount of gaps in secondary structure elements. The introduced gap lengths vary between 1 and 13 residues.
T-coffee 3D inserted a lot more gaps than the other multiple alignment tools, and these gaps were all quite short (~1-6 residues). Many short gaps are evolutionary not meaningful.
Functionally important residues
The functionally important sites are according to Uniprot[8] the following sites:
- Metal binding site, S: 206 (161)
- Metal binding site, Q: 211 (166)
- Metal binding site, I: 212 (167)
As the Uniprot sequence is 445 aa long and the PDB sequence (1U5B) only 400 aa (without transit sequence), one has to consider the offset. The functional site positions in brackets are used to determine their conservation in the multiple sequence alignments.
Conservation | ClustalW | Cobalt | Muscle | T-Coffee | T-Coffee 3D |
---|---|---|---|---|---|
Site 161 (S) | 20/21 | 16/21 | 16/21 | 16/21 | 15/21 |
Site 166 (Q) | 21/21 | 21/21 | 21/21 | 21/21 | 21/21 |
Site 167 (I) | 18/21 | 14/21 | 14/21 | 14/21 | 14/21 |
Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.
So all multiple alignment tools could preserve an overall conservation of the the functional residues, but the degree of conservation varies depending on the functional site and the alignment tool. Glutamine on position 166 in the sequence is 100% conserved across all tools. ClustalW manages also to conserve serine on position 161 and isoleucine on position 167 quite well (85% and 95%, respectively). Cobalt, Muscle, T-Coffee and T-Coffee 3D could conserve serine with 71-76% and isoleucine with 66%. Note that although the degree of conservation for isoleucine is the same, the sequences over which isoleucine is conserved are a little bit different. So the tools did not align this position identically.
References
Secondary structure information
back to Reference Sequence of BCKDHA
back to Maple syrup urine disease main page
go to Task 3 Secondary Structure Prediction BCKDHA