Difference between revisions of "Sequence Alignments BCKDHA"

From Bioinformatikpedia
(Sequence searches)
(Discussion)
 
(45 intermediate revisions by 2 users not shown)
Line 2: Line 2:
   
 
=== Sequence searches ===
 
=== Sequence searches ===
  +
  +
In order to find homolog sequences to our query protein BCKDHA we used the following tools:
  +
 
* FASTA
 
* FASTA
 
../bin/fasta36 sequence.fasta database > FastaOutput.txt
 
../bin/fasta36 sequence.fasta database > FastaOutput.txt
Line 20: Line 23:
   
 
* '''Overlap'''
 
* '''Overlap'''
To illustrate the overlap of the returned sequences for the sequences searches, Venn Diagrams were drawn.
+
To illustrate the overlap of the returned sequences for the sequence searches, Venn Diagrams were drawn.
   
[[Image:comparePSI3.png|thumb|right|Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)]]
+
[[Image:comparePSI3.png|thumb|right|Figure 1: Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)]]
[[Image:comparePSI5.png|thumb|right|Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)]]
+
[[Image:comparePSI5.png|thumb|right|Figure 2: Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)]]
The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. This fact was used to combine their results for the following Venn Diagramm (created with [http://bioinfogp.cnb.csic.es/tools/venny/index.html]):
+
The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See [[:File:comparePSI3.png | Figure 1]] and [[:File:comparePSI5.png | Figure 2]]) This fact was used to combine their results for the Venn Diagramm given in [[:File:vennDiagram.png | Figure 3]] (created with [http://bioinfogp.cnb.csic.es/tools/venny/index.html]):
[[File:vennDiagram.png|none|300px| Venn diagram showing the overlap of found sequences for the PSI-BLAST runs, BLAST and FASTA]]
+
[[File:vennDiagram.png|thumb|center|300px|Figure 3: Venn diagram showing the overlap of found sequences for the PSI-BLAST runs, BLAST and FASTA]]
   
 
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search.
 
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search.
BLAST returned only one additional Hit, that was not found in any of the PSI-BLAST searches, but which is also included in the FASTA results. Any PSI-BLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more Hits were returned than using PSIBLAST with only 3 iterations. This may be due to a longer search time.
+
BLAST returned only one additional hit, that was not found in any of the PSIBLAST searches, but which is also included in the FASTA results. Any PSIBLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile.
All in all one could say, that the Search algorithms returned nearly the same sequences matching our query.
+
All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits.
   
Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown below.
+
Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in [[:File:VennDiagram2.png | Figure 4]].
[[Image:VennDiagram2.png|none|300px |Venn Diagram showing the overlap of found sequences for HHSearch, Blast, Fasta and Psiblast]]
+
[[Image:VennDiagram2.png|thumb|center|300px|Figure 4: Venn Diagram showing the overlap of found sequences for HHSearch, Blast, Fasta and PSIBLAST]]
   
   
 
* '''Identity Distribution'''
 
* '''Identity Distribution'''
[[Image:IdentityDistribution.png|none|500px|Identity Distribution for the results from the Sequence Searches]]
+
[[:File:IdentityDistribution.png | Figure 5]] shows the Identity Distribution of the results from the different search tools.
  +
[[Image:IdentityDistribution.png|thumb|center|500px|Figure 5: Identity Distribution for the results from the Sequence Searches]]
As the PSIBlast runs with each 3 iterations resulted in the same Hits, as well as the PSI-BLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour.
 
  +
As the PSIBLAST runs with each 3 iterations resulted in the same hits, as well as the PSIBLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour.
Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 Hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 Hits each. Therefore a lot of the FASTA 'Hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.
 
  +
Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 hits each. Therefore a lot of the FASTA 'hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.
   
   
 
* '''Evalue Distribution'''
 
* '''Evalue Distribution'''
[[Image:EvalueDistribution.png|none|500px| E-value Distribution for the results from the Sequence Searches]]
+
In [[:File:EvalueDistribution.png | Figure 6]] the E-values for the sequence search results are displayed.
  +
[[Image:EvalueDistribution.png|thumb|center|500px|Figure 6: E-value Distribution for the results from the Sequence Searches]]
The E-value distribution for the sequence searches is quite similar concerining the BLAST and PSI-BLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more Hits, among which a lot of sequences have little identity and therefore a quite high E-value.
 
  +
The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSIBLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more hits, among a lot of sequences having little identity and therefore a quite high E-value.
   
   
 
* '''HSSP recall'''
 
* '''HSSP recall'''
 
To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made:
 
To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made:
HSSP uses Uniprot Identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL.
+
HSSP uses Uniprot identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL.
 
The mapping was performed using the ID Mapping tool provided by NCBI[http://www.uniprot.org/jobs/].
 
The mapping was performed using the ID Mapping tool provided by NCBI[http://www.uniprot.org/jobs/].
  +
[[:File:BCKDHA_HSSP.png | Figure 7]] shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP.
[[File:BCKDHA_HSSP.png|none|300px]]
 
  +
[[File:BCKDHA_HSSP.png|thumb|center|300px|Figure 7: Overlap of sequences with HSSP]]
As expected from the large number of Hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.
 
  +
As expected from the large number of hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.
   
 
Precision and recall for the different alignment methods calculated.
 
Precision and recall for the different alignment methods calculated.
Line 83: Line 89:
 
|}
 
|}
   
  +
The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results.
   
 
Sequences chosen for the multiple Alignment:
 
Sequences chosen for the multiple Alignment:
Line 144: Line 151:
   
 
=== Multiple Alignments ===
 
=== Multiple Alignments ===
  +
* [[ClustalW]]
 
  +
The following tools were used to create a multiple sequence alignment:
  +
* ClustalW
 
clustalw sequences.fasta
 
clustalw sequences.fasta
   
* [[T-Coffee]]
+
* T-Coffee
 
t_coffee -seq sequences.fasta
 
t_coffee -seq sequences.fasta
   
* [[T-Coffee(3D)]]
+
* T-Coffee(3D)
 
t_coffee -seq sequences.fasta -mode expresso
 
t_coffee -seq sequences.fasta -mode expresso
   
* [[Muscle]]
+
* Muscle
 
muscle -in sequences.fasta -out output.aln
 
muscle -in sequences.fasta -out output.aln
   
* [[Cobalt]]
+
* Cobalt
   
 
download [ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt cobalt]
 
download [ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt cobalt]
   
 
./cobalt -i sequences.fasta -norps T > output.aln
 
./cobalt -i sequences.fasta -norps T > output.aln
 
 
   
 
=== Conservation and Gaps ===
 
=== Conservation and Gaps ===
Line 235: Line 242:
 
==== ClustalW ====
 
==== ClustalW ====
   
  +
The gaps in secondary structure elements for the ClustalW alignment are shown in [[:File:clustalw_gaps_structure.png | Figure 8]]
[[Image:clustalw_gaps_structure.png|thumb|right|ClustalW gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
 
  +
  +
[[Image:clustalw_gaps_structure.png|thumb|right|Figure 8: ClustalW gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 269: Line 278:
 
==== T-Coffee ====
 
==== T-Coffee ====
   
  +
The gaps in secondary structure elements for the T-Coffee alignment are shown in [[:File:t-coffee_gaps_structure.png | Figure 9]]
[[Image:t-coffee_gaps_structure.png|thumb|right|T-Coffee gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
 
  +
  +
[[Image:t-coffee_gaps_structure.png|thumb|right|Fogure 9: T-Coffee gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 305: Line 316:
 
|}
 
|}
   
==== T-Coffee 3d ====
+
==== T-Coffee 3D ====
   
  +
[[Image:t-coffee_3d_gaps_structure.png|thumb|right|T-Coffee (3D) gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
 
  +
The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in [[:File:t-coffee_3d_gaps_structure.png | Figure 10]]
  +
  +
[[Image:t-coffee_3d_gaps_structure.png|thumb|right|Figure 10: T-Coffee (3D) gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 413: Line 427:
 
==== Cobalt ====
 
==== Cobalt ====
   
  +
The gaps in secondary structure elements for the Cobalt alignment are shown in [[:File:cobalt_gaps_structure.png | Figure 11]]
[[Image:cobalt_gaps_structure.png|thumb|right|Cobalt gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
 
  +
  +
[[Image:cobalt_gaps_structure.png|thumb|right|Figure 11: Cobalt gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 463: Line 479:
 
==== Muscle ====
 
==== Muscle ====
   
  +
The gaps in secondary structure elements for the Muscle alignment are shown in [[:File:muscle_gaps_structure.png | Figure 12]]
[[Image:muscle_gaps_structure.png|thumb|right|Muscle gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
 
  +
  +
[[Image:muscle_gaps_structure.png|thumb|right|Figure 12: Muscle gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]]
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 498: Line 516:
 
|Helix
 
|Helix
 
|}
 
|}
  +
  +
==== Discussion ====
  +
ClustalW, T-Coffee, Cobalt and Muscle all produced about the same amount of gaps in secondary structure elements. The introduced gap lengths vary between 1 and 13 residues.
  +
  +
T-coffee 3D inserted a lot more gaps than the other multiple alignment tools, and these gaps were all quite short (~1-6 residues).
  +
Many short gaps are evolutionary not meaningful.
   
 
=== Functionally important residues ===
 
=== Functionally important residues ===
Line 528: Line 552:
 
|21/21
 
|21/21
 
|-
 
|-
|'''Site 161 (I)'''
+
|'''Site 167 (I)'''
 
|18/21
 
|18/21
 
|14/21
 
|14/21
Line 536: Line 560:
 
|}
 
|}
   
The pictures below show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.
+
Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.
   
 
{| class = "centered"
 
{| class = "centered"
| [[File:BCKDHA_clustalW.PNG|thumb|ClustalW multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
+
| [[File:BCKDHA_clustalW.PNG|thumb|Figure 13: ClustalW multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
| [[File:BCKDHA_cobalt.PNG|thumb|Cobalt multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
+
| [[File:BCKDHA_cobalt.PNG|thumb|Figure 14: Cobalt multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
| [[File:BCKDHA_muscle.PNG|thumb|Muscle multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 306, 312, 313 in the msa.]]
+
| [[File:BCKDHA_muscle.PNG|thumb|Figure 15: Muscle multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 306, 312, 313 in the msa.]]
 
|}
 
|}
   
 
{| class="centered"
 
{| class="centered"
| [[File:BCKDHA_TCoffee.PNG|thumb|T-Coffee multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
+
| [[File:BCKDHA_TCoffee.PNG|thumb|Figure 16: T-Coffee multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]]
| [[File:BCKDHA_TCoffee3D.PNG|thumb|T-Coffee 3D multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 423, 430, 431 in the msa.]]
+
| [[File:BCKDHA_TCoffee3D.PNG|thumb|Figure 17: T-Coffee 3D multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 423, 430, 431 in the msa.]]
 
|}
 
|}
  +
  +
So all multiple alignment tools could preserve an overall conservation of the the functional residues, but the degree of conservation varies depending on the functional site and the alignment tool. Glutamine on position 166 in the sequence is 100% conserved across all tools. ClustalW manages also to conserve
  +
serine on position 161 and isoleucine on position 167 quite well (85% and 95%, respectively). Cobalt, Muscle, T-Coffee and T-Coffee 3D could conserve serine with 71-76% and isoleucine with 66%. Note that although the degree of conservation for isoleucine is the same, the sequences over which isoleucine is conserved are a little bit different. So the tools did not align this position identically.
   
 
=== References ===
 
=== References ===
Line 557: Line 584:
 
back to [[Maple syrup urine disease]] main page
 
back to [[Maple syrup urine disease]] main page
   
go to Task 2 [[Secondary Structure Prediction BCKDHA]]
+
go to Task 3 [[Secondary Structure Prediction BCKDHA]]

Latest revision as of 15:56, 24 August 2011

Sequence Alignments

Sequence searches

In order to find homolog sequences to our query protein BCKDHA we used the following tools:

  • FASTA

../bin/fasta36 sequence.fasta database > FastaOutput.txt

  • BLAST

blastall -p blastp -d database -i sequence.fasta > BlastOutput.txt

  • PSIBLAST

blastpgp -i sequence.fasta -j iterations -h evalueCutoff -d database > PsiblastOutput.txt

  • HHSearch

hhsearch -i query -d database -o output.txt

database = /data/blast/nr/nr


Result Statistics

  • Overlap

To illustrate the overlap of the returned sequences for the sequence searches, Venn Diagrams were drawn.

Figure 1: Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)
Figure 2: Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)

The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See Figure 1 and Figure 2) This fact was used to combine their results for the Venn Diagramm given in Figure 3 (created with [6]):

Figure 3: Venn diagram showing the overlap of found sequences for the PSI-BLAST runs, BLAST and FASTA

Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. BLAST returned only one additional hit, that was not found in any of the PSIBLAST searches, but which is also included in the FASTA results. Any PSIBLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile. All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits.

Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in Figure 4.

Figure 4: Venn Diagram showing the overlap of found sequences for HHSearch, Blast, Fasta and PSIBLAST


  • Identity Distribution

Figure 5 shows the Identity Distribution of the results from the different search tools.

Figure 5: Identity Distribution for the results from the Sequence Searches

As the PSIBLAST runs with each 3 iterations resulted in the same hits, as well as the PSIBLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 hits each. Therefore a lot of the FASTA 'hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.


  • Evalue Distribution

In Figure 6 the E-values for the sequence search results are displayed.

Figure 6: E-value Distribution for the results from the Sequence Searches

The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSIBLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more hits, among a lot of sequences having little identity and therefore a quite high E-value.


  • HSSP recall

To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made: HSSP uses Uniprot identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL. The mapping was performed using the ID Mapping tool provided by NCBI[7]. Figure 7 shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP.

Figure 7: Overlap of sequences with HSSP

As expected from the large number of hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.

Precision and recall for the different alignment methods calculated.

Blast Fasta HHSearch Psiblast,3,0.005 Psiblast,5,0.005 Psiblast,3,10E6 Psiblast,5,0.005
Precision 0.97 0.37 0.20 0.89 0.87 0.89 0.87
Recall 0.07 0.24 0 0.07 0.07 0.07 0.07

The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results.

Sequences chosen for the multiple Alignment:

SeqIdentifier Seq Identity source
99-90% Sequence Identity
56967006|pdb|1X7Z 99% PSI BLAST, 3 iterations, E-value cutoff 0.005
7546384|pdb|1DTW 95% BLAST
34810149|pdb|1OLU 99% PSI BLAST, 3 iterations, E-value cutoff 10E-6
13277798|gb|AAH03787.1 95% PSI BLAST, 3 iterations, E-value cutoff 10E-6
148727347|ref|NP_001092034.1 95% BLAST
89-60% Sequence Identity
196011048|ref|XP_002115388.1 66% PSI BLAST, 3 iterations, E-value cutoff 0.005
149543950|ref|XP_001517857.1 67% BLAST
47227873|emb|CAG09036.1 82,5% FASTA
47196273|emb|CAF88112.1 81% PSI BLAST, 5 iterations, E-value cutoff 0.005
12964598|dbj|BAB32665.1 88% PSI BLAST, 5 iterations, E-value cutoff 10E-6
59-40% Sequence Identity
193290664|gb|ACF17640.1 47% BLAST
215431443|ref|ZP_03429362.1 40% FASTA
225557347|gb|EEH05633.1 51% PSI BLAST, 3 iterations, E-value cutoff 10E-6
58267618|ref|XP_570965.1 50% PSI BLAST, 5 iterations, E-value cutoff 0.005
162449842|ref|YP_001612209.1 41% PSI BLAST, 5 iterations, E-value cutoff 10E-6
39-20% Sequence Identity
56966700|pdb|1W85 31% PSI BLAST, 3 iterations, E-value cutoff 0.005
5822330|pdb|1QS0 38.1% FASTA
13516864|dbj|BAB40585.1 33% PSI BLAST, 3 iterations, E-value cutoff 10E-6
284166853|ref|YP_003405132.1 35% PSI BLAST, 5 iterations, E-value cutoff 0.005
76800932|ref|YP_325940.1 34% PSI BLAST, 5 iterations, E-value cutoff 10E-6

Sequences for the Multiple Sequences Alignment were downloaded via NCBI, the sequence id can be changed in the link to retrieve the fasta format: http://www.ncbi.nlm.nih.gov/protein/76800932?report=fasta

Multiple Alignments

The following tools were used to create a multiple sequence alignment:

  • ClustalW

clustalw sequences.fasta

  • T-Coffee

t_coffee -seq sequences.fasta

  • T-Coffee(3D)

t_coffee -seq sequences.fasta -mode expresso

  • Muscle

muscle -in sequences.fasta -out output.aln

  • Cobalt

download cobalt

./cobalt -i sequences.fasta -norps T > output.aln

Conservation and Gaps

Alignment methods Gaps Conserved Columns
Gaps Avg Gap Length 100% cons >90% cons >80% cons >70% cons >60% cons >50% cons
ClustalW 12 3,75 24 50 31 49 54 72
T-Coffee 25 4,56 24 50 31 49 54 72
T-Coffee (3D) 56 4,75 21 45 34 49 64 71
Cobalt 19 3,26 24 55 31 45 60 71
Muscle 17 4,76 26 46 22 31 14 8

Gaps in secondary structure

ClustalW

The gaps in secondary structure elements for the ClustalW alignment are shown in Figure 8

Figure 8: ClustalW gaps in secondary structure elements [1]
Gap position Gap length Secondary structure
109-110 4 Helix
142-143 1 Helix
235-236 1 Beta strand
276-277 11 Helix
294-295 1 Beta strand
394-395 5 Helix

T-Coffee

The gaps in secondary structure elements for the T-Coffee alignment are shown in Figure 9

Fogure 9: T-Coffee gaps in secondary structure elements [2]
Gap position Gap length Secondary structure
141-142 1 Helix
232-233 1 Beta strand
275-276 11 Helix
310-311 1 Helix
369-370 5 Turn
395-396 18 Helix
398-399 5 Helix

T-Coffee 3D

The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in Figure 10

Figure 10: T-Coffee (3D) gaps in secondary structure elements [3]
Gap position Gap length Secondary structure
101-102 1 Helix
108-109 4 Helix
115-116 1 Helix
116-117 1 Helix
141-142 1 Helix
153-154 1 Beta strand
163-164 1 Helix
177-178 3 Helix
234-235 1 Beta strand
263-264 4 Beta strand
265-266 1 Beta strand
276-277 2 Helix
308-309 8 Helix
309-310 5 Helix
314-315 6 Helix
362-363 6 Helix
371-372 4 Turn
376-377 1 Helix
380-381 1 Helix
382-383 7 Helix
383-384 3 Helix
384-385 2 Helix
387-388 2 Helix
394-395 5 Helix

Cobalt

The gaps in secondary structure elements for the Cobalt alignment are shown in Figure 11

Figure 11: Cobalt gaps in secondary structure elements [4]
Gap position Gap length Secondary structure
108-109 4 Helix
141-142 1 Helix
177-178 3 Helix
276-277 11 Helix
294-295 1 Beta strand
305-306 1 Helix
311-312 2 Helix
387-388 1 Helix
388-389 1 Helix
395-396 13 Helix

Muscle

The gaps in secondary structure elements for the Muscle alignment are shown in Figure 12

Figure 12: Muscle gaps and structure [5]
Gap position Gap length Secondary structure
109-110 4 Helix
141-142 1 Helix
177-178 3 Helix
276-277 11 Helix
294-295 1 Beta strand
305-306 1 Helix
394-395 5 Helix

Discussion

ClustalW, T-Coffee, Cobalt and Muscle all produced about the same amount of gaps in secondary structure elements. The introduced gap lengths vary between 1 and 13 residues.

T-coffee 3D inserted a lot more gaps than the other multiple alignment tools, and these gaps were all quite short (~1-6 residues). Many short gaps are evolutionary not meaningful.

Functionally important residues

The functionally important sites are according to Uniprot[8] the following sites:

  • Metal binding site, S: 206 (161)
  • Metal binding site, Q: 211 (166)
  • Metal binding site, I: 212 (167)

As the Uniprot sequence is 445 aa long and the PDB sequence (1U5B) only 400 aa (without transit sequence), one has to consider the offset. The functional site positions in brackets are used to determine their conservation in the multiple sequence alignments.

Conservation ClustalW Cobalt Muscle T-Coffee T-Coffee 3D
Site 161 (S) 20/21 16/21 16/21 16/21 15/21
Site 166 (Q) 21/21 21/21 21/21 21/21 21/21
Site 167 (I) 18/21 14/21 14/21 14/21 14/21

Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.

Figure 13: ClustalW multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 14: Cobalt multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 15: Muscle multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 306, 312, 313 in the msa.
Figure 16: T-Coffee multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 17: T-Coffee 3D multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 423, 430, 431 in the msa.

So all multiple alignment tools could preserve an overall conservation of the the functional residues, but the degree of conservation varies depending on the functional site and the alignment tool. Glutamine on position 166 in the sequence is 100% conserved across all tools. ClustalW manages also to conserve serine on position 161 and isoleucine on position 167 quite well (85% and 95%, respectively). Cobalt, Muscle, T-Coffee and T-Coffee 3D could conserve serine with 71-76% and isoleucine with 66%. Note that although the degree of conservation for isoleucine is the same, the sequences over which isoleucine is conserved are a little bit different. So the tools did not align this position identically.

References

Secondary structure information

back to Reference Sequence of BCKDHA

back to Maple syrup urine disease main page

go to Task 3 Secondary Structure Prediction BCKDHA