Difference between revisions of "Sequence Alignments BCKDHA"
(New page: == Sequence Alignments == === Sequence searches === * FASTA ../bin/fasta36 sequence.fasta database > FastaOutput.txt * BLAST blastall -p blastp -d database -i sequence.fasta > BlastOutpu...) |
(→Discussion) |
||
(68 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
=== Sequence searches === |
=== Sequence searches === |
||
+ | |||
+ | In order to find homolog sequences to our query protein BCKDHA we used the following tools: |
||
+ | |||
* FASTA |
* FASTA |
||
../bin/fasta36 sequence.fasta database > FastaOutput.txt |
../bin/fasta36 sequence.fasta database > FastaOutput.txt |
||
Line 20: | Line 23: | ||
* '''Overlap''' |
* '''Overlap''' |
||
− | To illustrate the overlap of the returned sequences for the |
+ | To illustrate the overlap of the returned sequences for the sequence searches, Venn Diagrams were drawn. |
− | [[Image:comparePSI3.png|thumb|right|Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)]] |
+ | [[Image:comparePSI3.png|thumb|right|Figure 1: Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)]] |
− | [[Image:comparePSI5.png|thumb|right|Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)]] |
+ | [[Image:comparePSI5.png|thumb|right|Figure 2: Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)]] |
− | The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. This fact was used to combine their results for the |
+ | The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See [[:File:comparePSI3.png | Figure 1]] and [[:File:comparePSI5.png | Figure 2]]) This fact was used to combine their results for the Venn Diagramm given in [[:File:vennDiagram.png | Figure 3]] (created with [http://bioinfogp.cnb.csic.es/tools/venny/index.html]): |
+ | [[File:vennDiagram.png|thumb|center|300px|Figure 3: Venn diagram showing the overlap of found sequences for the PSI-BLAST runs, BLAST and FASTA]] |
||
− | [[Image:vennDiagram.png|none|300px]] |
||
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. |
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. |
||
− | BLAST returned only one additional |
+ | BLAST returned only one additional hit, that was not found in any of the PSIBLAST searches, but which is also included in the FASTA results. Any PSIBLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile. |
− | All in all one could say, that the |
+ | All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits. |
+ | |||
+ | Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in [[:File:VennDiagram2.png | Figure 4]]. |
||
+ | [[Image:VennDiagram2.png|thumb|center|300px|Figure 4: Venn Diagram showing the overlap of found sequences for HHSearch, Blast, Fasta and PSIBLAST]] |
||
* '''Identity Distribution''' |
* '''Identity Distribution''' |
||
− | [[ |
+ | [[:File:IdentityDistribution.png | Figure 5]] shows the Identity Distribution of the results from the different search tools. |
+ | [[Image:IdentityDistribution.png|thumb|center|500px|Figure 5: Identity Distribution for the results from the Sequence Searches]] |
||
− | As the PSIBlast runs with each 3 iterations resulted in the same Hits, as well as the PSI-BLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. |
||
+ | As the PSIBLAST runs with each 3 iterations resulted in the same hits, as well as the PSIBLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. |
||
− | Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 Hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 Hits each. Therefore a lot of the FASTA 'Hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results. |
||
+ | Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 hits each. Therefore a lot of the FASTA 'hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results. |
||
* '''Evalue Distribution''' |
* '''Evalue Distribution''' |
||
− | [[ |
+ | In [[:File:EvalueDistribution.png | Figure 6]] the E-values for the sequence search results are displayed. |
+ | [[Image:EvalueDistribution.png|thumb|center|500px|Figure 6: E-value Distribution for the results from the Sequence Searches]] |
||
− | The E-value distribution for the sequence searches is quite similar concerining the BLAST and PSI-BLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more Hits, among which a lot of sequences have little identity and therefore a quite high E-value. |
||
+ | The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSIBLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more hits, among a lot of sequences having little identity and therefore a quite high E-value. |
||
+ | * '''HSSP recall''' |
||
+ | To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made: |
||
+ | HSSP uses Uniprot identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL. |
||
+ | The mapping was performed using the ID Mapping tool provided by NCBI[http://www.uniprot.org/jobs/]. |
||
+ | [[:File:BCKDHA_HSSP.png | Figure 7]] shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP. |
||
+ | [[File:BCKDHA_HSSP.png|thumb|center|300px|Figure 7: Overlap of sequences with HSSP]] |
||
+ | As expected from the large number of hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP. |
||
+ | |||
+ | Precision and recall for the different alignment methods calculated. |
||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !Blast |
||
+ | !Fasta |
||
+ | !HHSearch |
||
+ | !Psiblast,3,0.005 |
||
+ | !Psiblast,5,0.005 |
||
+ | !Psiblast,3,10E6 |
||
+ | !Psiblast,5,0.005 |
||
+ | |- |
||
+ | |'''Precision''' |
||
+ | |0.97 |
||
+ | |0.37 |
||
+ | |0.20 |
||
+ | |0.89 |
||
+ | |0.87 |
||
+ | |0.89 |
||
+ | |0.87 |
||
+ | |- |
||
+ | |'''Recall''' |
||
+ | |0.07 |
||
+ | |0.24 |
||
+ | |0 |
||
+ | |0.07 |
||
+ | |0.07 |
||
+ | |0.07 |
||
+ | |0.07 |
||
+ | |} |
||
+ | The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results. |
||
Sequences chosen for the multiple Alignment: |
Sequences chosen for the multiple Alignment: |
||
Line 105: | Line 151: | ||
=== Multiple Alignments === |
=== Multiple Alignments === |
||
+ | |||
− | * [[ClustalW]] |
||
+ | The following tools were used to create a multiple sequence alignment: |
||
+ | * ClustalW |
||
clustalw sequences.fasta |
clustalw sequences.fasta |
||
− | * |
+ | * T-Coffee |
t_coffee -seq sequences.fasta |
t_coffee -seq sequences.fasta |
||
− | * |
+ | * T-Coffee(3D) |
t_coffee -seq sequences.fasta -mode expresso |
t_coffee -seq sequences.fasta -mode expresso |
||
− | * |
+ | * Muscle |
muscle -in sequences.fasta -out output.aln |
muscle -in sequences.fasta -out output.aln |
||
− | * |
+ | * Cobalt |
download [ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt cobalt] |
download [ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt cobalt] |
||
./cobalt -i sequences.fasta -norps T > output.aln |
./cobalt -i sequences.fasta -norps T > output.aln |
||
− | |||
− | |||
=== Conservation and Gaps === |
=== Conservation and Gaps === |
||
Line 145: | Line 191: | ||
|3,75 |
|3,75 |
||
|24 |
|24 |
||
− | | |
+ | |50 |
− | | |
+ | |31 |
− | | |
+ | |49 |
− | | |
+ | |54 |
− | | |
+ | |72 |
|- |
|- |
||
|T-Coffee |
|T-Coffee |
||
Line 155: | Line 201: | ||
|4,56 |
|4,56 |
||
|24 |
|24 |
||
− | | |
+ | |50 |
− | | |
+ | |31 |
− | | |
+ | |49 |
− | | |
+ | |54 |
− | | |
+ | |72 |
|- |
|- |
||
|T-Coffee (3D) |
|T-Coffee (3D) |
||
Line 165: | Line 211: | ||
|4,75 |
|4,75 |
||
|21 |
|21 |
||
− | | |
+ | |45 |
− | | |
+ | |34 |
− | | |
+ | |49 |
− | | |
+ | |64 |
|71 |
|71 |
||
|- |
|- |
||
Line 175: | Line 221: | ||
|3,26 |
|3,26 |
||
|24 |
|24 |
||
− | | |
+ | |55 |
− | | |
+ | |31 |
− | | |
+ | |45 |
− | | |
+ | |60 |
− | | |
+ | |71 |
|- |
|- |
||
|Muscle |
|Muscle |
||
Line 185: | Line 231: | ||
|4,76 |
|4,76 |
||
|26 |
|26 |
||
− | | |
+ | |46 |
− | | |
+ | |22 |
− | | |
+ | |31 |
− | | |
+ | |14 |
− | | |
+ | |8 |
|} |
|} |
||
− | |||
=== Gaps in secondary structure === |
=== Gaps in secondary structure === |
||
Line 197: | Line 242: | ||
==== ClustalW ==== |
==== ClustalW ==== |
||
+ | The gaps in secondary structure elements for the ClustalW alignment are shown in [[:File:clustalw_gaps_structure.png | Figure 8]] |
||
− | [[Image:clustalw_gaps_structure.png|thumb|right|ClustalW gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
+ | |||
+ | [[Image:clustalw_gaps_structure.png|thumb|right|Figure 8: ClustalW gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 231: | Line 278: | ||
==== T-Coffee ==== |
==== T-Coffee ==== |
||
+ | The gaps in secondary structure elements for the T-Coffee alignment are shown in [[:File:t-coffee_gaps_structure.png | Figure 9]] |
||
− | [[Image:t-coffee_gaps_structure.png|thumb|right|T-Coffee gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
+ | |||
+ | [[Image:t-coffee_gaps_structure.png|thumb|right|Fogure 9: T-Coffee gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 267: | Line 316: | ||
|} |
|} |
||
− | ==== T-Coffee |
+ | ==== T-Coffee 3D ==== |
+ | |||
− | [[Image:t-coffee_3d_gaps_structure.png|thumb|right|T-Coffee (3D) gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
+ | The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in [[:File:t-coffee_3d_gaps_structure.png | Figure 10]] |
||
+ | |||
+ | [[Image:t-coffee_3d_gaps_structure.png|thumb|right|Figure 10: T-Coffee (3D) gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 375: | Line 427: | ||
==== Cobalt ==== |
==== Cobalt ==== |
||
+ | The gaps in secondary structure elements for the Cobalt alignment are shown in [[:File:cobalt_gaps_structure.png | Figure 11]] |
||
− | [[Image:cobalt_gaps_structure.png|thumb|right|Cobalt gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
+ | |||
+ | [[Image:cobalt_gaps_structure.png|thumb|right|Figure 11: Cobalt gaps in secondary structure elements [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 425: | Line 479: | ||
==== Muscle ==== |
==== Muscle ==== |
||
+ | The gaps in secondary structure elements for the Muscle alignment are shown in [[:File:muscle_gaps_structure.png | Figure 12]] |
||
− | [[Image:muscle_gaps_structure.png|thumb|right|Muscle gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
+ | |||
+ | [[Image:muscle_gaps_structure.png|thumb|right|Figure 12: Muscle gaps and structure [http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1U5B]]] |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 460: | Line 516: | ||
|Helix |
|Helix |
||
|} |
|} |
||
+ | |||
+ | ==== Discussion ==== |
||
+ | ClustalW, T-Coffee, Cobalt and Muscle all produced about the same amount of gaps in secondary structure elements. The introduced gap lengths vary between 1 and 13 residues. |
||
+ | |||
+ | T-coffee 3D inserted a lot more gaps than the other multiple alignment tools, and these gaps were all quite short (~1-6 residues). |
||
+ | Many short gaps are evolutionary not meaningful. |
||
=== Functionally important residues === |
=== Functionally important residues === |
||
− | The functionally important sites are according to [http://www.uniprot.org/uniprot/P12694] the following sites: |
+ | The functionally important sites are according to Uniprot[http://www.uniprot.org/uniprot/P12694] the following sites: |
− | * Metal binding site, 206 |
+ | * Metal binding site, S: 206 (161) |
− | * Metal binding site, 211 |
+ | * Metal binding site, Q: 211 (166) |
− | * Metal binding site, 212 |
+ | * Metal binding site, I: 212 (167) |
+ | As the Uniprot sequence is 445 aa long and the PDB sequence (1U5B) only 400 aa (without transit sequence), one has to consider the offset. The functional site positions in brackets are used to determine their conservation in the multiple sequence alignments. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | !Conservation |
||
+ | !ClustalW |
||
+ | !Cobalt |
||
+ | !Muscle |
||
+ | !T-Coffee |
||
+ | !T-Coffee 3D |
||
+ | |- |
||
+ | |'''Site 161 (S)''' |
||
+ | |20/21 |
||
+ | |16/21 |
||
+ | |16/21 |
||
+ | |16/21 |
||
+ | |15/21 |
||
+ | |- |
||
+ | |'''Site 166 (Q)''' |
||
+ | |21/21 |
||
+ | |21/21 |
||
+ | |21/21 |
||
+ | |21/21 |
||
+ | |21/21 |
||
+ | |- |
||
+ | |'''Site 167 (I)''' |
||
+ | |18/21 |
||
+ | |14/21 |
||
+ | |14/21 |
||
+ | |14/21 |
||
+ | |14/21 |
||
+ | |} |
||
+ | |||
+ | Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview. |
||
+ | |||
+ | {| class = "centered" |
||
+ | | [[File:BCKDHA_clustalW.PNG|thumb|Figure 13: ClustalW multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]] |
||
+ | | [[File:BCKDHA_cobalt.PNG|thumb|Figure 14: Cobalt multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]] |
||
+ | | [[File:BCKDHA_muscle.PNG|thumb|Figure 15: Muscle multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 306, 312, 313 in the msa.]] |
||
+ | |} |
||
+ | |||
+ | {| class="centered" |
||
+ | | [[File:BCKDHA_TCoffee.PNG|thumb|Figure 16: T-Coffee multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.]] |
||
+ | | [[File:BCKDHA_TCoffee3D.PNG|thumb|Figure 17: T-Coffee 3D multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 423, 430, 431 in the msa.]] |
||
+ | |} |
||
+ | |||
+ | So all multiple alignment tools could preserve an overall conservation of the the functional residues, but the degree of conservation varies depending on the functional site and the alignment tool. Glutamine on position 166 in the sequence is 100% conserved across all tools. ClustalW manages also to conserve |
||
+ | serine on position 161 and isoleucine on position 167 quite well (85% and 95%, respectively). Cobalt, Muscle, T-Coffee and T-Coffee 3D could conserve serine with 71-76% and isoleucine with 66%. Note that although the degree of conservation for isoleucine is the same, the sequences over which isoleucine is conserved are a little bit different. So the tools did not align this position identically. |
||
=== References === |
=== References === |
||
Line 472: | Line 581: | ||
back to [[Reference_Sequence_BCKDHA|Reference Sequence of BCKDHA]] |
back to [[Reference_Sequence_BCKDHA|Reference Sequence of BCKDHA]] |
||
+ | |||
back to [[Maple syrup urine disease]] main page |
back to [[Maple syrup urine disease]] main page |
||
+ | |||
+ | go to Task 3 [[Secondary Structure Prediction BCKDHA]] |
Latest revision as of 15:56, 24 August 2011
Contents
Sequence Alignments
Sequence searches
In order to find homolog sequences to our query protein BCKDHA we used the following tools:
- FASTA
../bin/fasta36 sequence.fasta database > FastaOutput.txt
- BLAST
blastall -p blastp -d database -i sequence.fasta > BlastOutput.txt
- PSIBLAST
blastpgp -i sequence.fasta -j iterations -h evalueCutoff -d database > PsiblastOutput.txt
- HHSearch
hhsearch -i query -d database -o output.txt
database = /data/blast/nr/nr
Result Statistics
- Overlap
To illustrate the overlap of the returned sequences for the sequence searches, Venn Diagrams were drawn.
The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See Figure 1 and Figure 2) This fact was used to combine their results for the Venn Diagramm given in Figure 3 (created with [6]):
Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. BLAST returned only one additional hit, that was not found in any of the PSIBLAST searches, but which is also included in the FASTA results. Any PSIBLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile. All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits.
Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in Figure 4.
- Identity Distribution
Figure 5 shows the Identity Distribution of the results from the different search tools.
As the PSIBLAST runs with each 3 iterations resulted in the same hits, as well as the PSIBLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 hits each. Therefore a lot of the FASTA 'hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.
- Evalue Distribution
In Figure 6 the E-values for the sequence search results are displayed.
The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSIBLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more hits, among a lot of sequences having little identity and therefore a quite high E-value.
- HSSP recall
To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made: HSSP uses Uniprot identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL. The mapping was performed using the ID Mapping tool provided by NCBI[7]. Figure 7 shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP.
As expected from the large number of hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.
Precision and recall for the different alignment methods calculated.
Blast | Fasta | HHSearch | Psiblast,3,0.005 | Psiblast,5,0.005 | Psiblast,3,10E6 | Psiblast,5,0.005 | |
---|---|---|---|---|---|---|---|
Precision | 0.97 | 0.37 | 0.20 | 0.89 | 0.87 | 0.89 | 0.87 |
Recall | 0.07 | 0.24 | 0 | 0.07 | 0.07 | 0.07 | 0.07 |
The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results.
Sequences chosen for the multiple Alignment:
SeqIdentifier | Seq Identity | source |
---|---|---|
99-90% Sequence Identity | ||
56967006|pdb|1X7Z | 99% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
7546384|pdb|1DTW | 95% | BLAST |
34810149|pdb|1OLU | 99% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
13277798|gb|AAH03787.1 | 95% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
148727347|ref|NP_001092034.1 | 95% | BLAST |
89-60% Sequence Identity | ||
196011048|ref|XP_002115388.1 | 66% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
149543950|ref|XP_001517857.1 | 67% | BLAST |
47227873|emb|CAG09036.1 | 82,5% | FASTA |
47196273|emb|CAF88112.1 | 81% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
12964598|dbj|BAB32665.1 | 88% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
59-40% Sequence Identity | ||
193290664|gb|ACF17640.1 | 47% | BLAST |
215431443|ref|ZP_03429362.1 | 40% | FASTA |
225557347|gb|EEH05633.1 | 51% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
58267618|ref|XP_570965.1 | 50% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
162449842|ref|YP_001612209.1 | 41% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
39-20% Sequence Identity | ||
56966700|pdb|1W85 | 31% | PSI BLAST, 3 iterations, E-value cutoff 0.005 |
5822330|pdb|1QS0 | 38.1% | FASTA |
13516864|dbj|BAB40585.1 | 33% | PSI BLAST, 3 iterations, E-value cutoff 10E-6 |
284166853|ref|YP_003405132.1 | 35% | PSI BLAST, 5 iterations, E-value cutoff 0.005 |
76800932|ref|YP_325940.1 | 34% | PSI BLAST, 5 iterations, E-value cutoff 10E-6 |
Sequences for the Multiple Sequences Alignment were downloaded via NCBI, the sequence id can be changed in the link to retrieve the fasta format: http://www.ncbi.nlm.nih.gov/protein/76800932?report=fasta
Multiple Alignments
The following tools were used to create a multiple sequence alignment:
- ClustalW
clustalw sequences.fasta
- T-Coffee
t_coffee -seq sequences.fasta
- T-Coffee(3D)
t_coffee -seq sequences.fasta -mode expresso
- Muscle
muscle -in sequences.fasta -out output.aln
- Cobalt
download cobalt
./cobalt -i sequences.fasta -norps T > output.aln
Conservation and Gaps
Alignment methods | Gaps | Conserved Columns | ||||||
---|---|---|---|---|---|---|---|---|
Gaps | Avg Gap Length | 100% cons | >90% cons | >80% cons | >70% cons | >60% cons | >50% cons | |
ClustalW | 12 | 3,75 | 24 | 50 | 31 | 49 | 54 | 72 |
T-Coffee | 25 | 4,56 | 24 | 50 | 31 | 49 | 54 | 72 |
T-Coffee (3D) | 56 | 4,75 | 21 | 45 | 34 | 49 | 64 | 71 |
Cobalt | 19 | 3,26 | 24 | 55 | 31 | 45 | 60 | 71 |
Muscle | 17 | 4,76 | 26 | 46 | 22 | 31 | 14 | 8 |
Gaps in secondary structure
ClustalW
The gaps in secondary structure elements for the ClustalW alignment are shown in Figure 8
Gap position | Gap length | Secondary structure |
109-110 | 4 | Helix |
142-143 | 1 | Helix |
235-236 | 1 | Beta strand |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
394-395 | 5 | Helix |
T-Coffee
The gaps in secondary structure elements for the T-Coffee alignment are shown in Figure 9
Gap position | Gap length | Secondary structure |
141-142 | 1 | Helix |
232-233 | 1 | Beta strand |
275-276 | 11 | Helix |
310-311 | 1 | Helix |
369-370 | 5 | Turn |
395-396 | 18 | Helix |
398-399 | 5 | Helix |
T-Coffee 3D
The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in Figure 10
Gap position | Gap length | Secondary structure |
101-102 | 1 | Helix |
108-109 | 4 | Helix |
115-116 | 1 | Helix |
116-117 | 1 | Helix |
141-142 | 1 | Helix |
153-154 | 1 | Beta strand |
163-164 | 1 | Helix |
177-178 | 3 | Helix |
234-235 | 1 | Beta strand |
263-264 | 4 | Beta strand |
265-266 | 1 | Beta strand |
276-277 | 2 | Helix |
308-309 | 8 | Helix |
309-310 | 5 | Helix |
314-315 | 6 | Helix |
362-363 | 6 | Helix |
371-372 | 4 | Turn |
376-377 | 1 | Helix |
380-381 | 1 | Helix |
382-383 | 7 | Helix |
383-384 | 3 | Helix |
384-385 | 2 | Helix |
387-388 | 2 | Helix |
394-395 | 5 | Helix |
Cobalt
The gaps in secondary structure elements for the Cobalt alignment are shown in Figure 11
Gap position | Gap length | Secondary structure |
108-109 | 4 | Helix |
141-142 | 1 | Helix |
177-178 | 3 | Helix |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
305-306 | 1 | Helix |
311-312 | 2 | Helix |
387-388 | 1 | Helix |
388-389 | 1 | Helix |
395-396 | 13 | Helix |
Muscle
The gaps in secondary structure elements for the Muscle alignment are shown in Figure 12
Gap position | Gap length | Secondary structure |
109-110 | 4 | Helix |
141-142 | 1 | Helix |
177-178 | 3 | Helix |
276-277 | 11 | Helix |
294-295 | 1 | Beta strand |
305-306 | 1 | Helix |
394-395 | 5 | Helix |
Discussion
ClustalW, T-Coffee, Cobalt and Muscle all produced about the same amount of gaps in secondary structure elements. The introduced gap lengths vary between 1 and 13 residues.
T-coffee 3D inserted a lot more gaps than the other multiple alignment tools, and these gaps were all quite short (~1-6 residues). Many short gaps are evolutionary not meaningful.
Functionally important residues
The functionally important sites are according to Uniprot[8] the following sites:
- Metal binding site, S: 206 (161)
- Metal binding site, Q: 211 (166)
- Metal binding site, I: 212 (167)
As the Uniprot sequence is 445 aa long and the PDB sequence (1U5B) only 400 aa (without transit sequence), one has to consider the offset. The functional site positions in brackets are used to determine their conservation in the multiple sequence alignments.
Conservation | ClustalW | Cobalt | Muscle | T-Coffee | T-Coffee 3D |
---|---|---|---|---|---|
Site 161 (S) | 20/21 | 16/21 | 16/21 | 16/21 | 15/21 |
Site 166 (Q) | 21/21 | 21/21 | 21/21 | 21/21 | 21/21 |
Site 167 (I) | 18/21 | 14/21 | 14/21 | 14/21 | 14/21 |
Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.
So all multiple alignment tools could preserve an overall conservation of the the functional residues, but the degree of conservation varies depending on the functional site and the alignment tool. Glutamine on position 166 in the sequence is 100% conserved across all tools. ClustalW manages also to conserve serine on position 161 and isoleucine on position 167 quite well (85% and 95%, respectively). Cobalt, Muscle, T-Coffee and T-Coffee 3D could conserve serine with 71-76% and isoleucine with 66%. Note that although the degree of conservation for isoleucine is the same, the sequences over which isoleucine is conserved are a little bit different. So the tools did not align this position identically.
References
Secondary structure information
back to Reference Sequence of BCKDHA
back to Maple syrup urine disease main page
go to Task 3 Secondary Structure Prediction BCKDHA