Sequence Alignments BCKDHA

From Bioinformatikpedia
Revision as of 13:29, 9 August 2011 by Demel (talk | contribs) (Muscle)

Sequence Alignments

Sequence searches

In order to find homolog sequences to our query protein BCKDHA we used the following tools:

  • FASTA

../bin/fasta36 sequence.fasta database > FastaOutput.txt

  • BLAST

blastall -p blastp -d database -i sequence.fasta > BlastOutput.txt

  • PSIBLAST

blastpgp -i sequence.fasta -j iterations -h evalueCutoff -d database > PsiblastOutput.txt

  • HHSearch

hhsearch -i query -d database -o output.txt

database = /data/blast/nr/nr


Result Statistics

  • Overlap

To illustrate the overlap of the returned sequences for the sequences searches, Venn Diagrams were drawn.

Figure 1: Comparisons of the results for the PSIBLAST runs with each 3 iteraions. (PSI1 = PSI-BLAST run with 3 iterations, E-value cutoff 0.005, PSI3 = PSI-BLAST run with 3 iterations, E-value cutoff 10E6)
Figure 2: Comparisons of the results for the PSIBLAST runs with each 5 iteraions. (PSI2 = PSI-BLAST run with 5 iterations, E-value cutoff 0.005, PSI4 = PSI-BLAST run with 5 iterations, E-value cutoff 10E6)

The PSI-BLAST runs with each 3 iterations returned absolutely the same sequences match our query BCKDHA. The same is true for the PSI-BLAST runs with each 5 iterations. (See Figure 1 and Figure 2) This fact was used to combine their results for the Venn Diagramm given in Figure 3 (created with [6]):

Figure 3: Venn diagram showing the overlap of found sequences for the PSI-BLAST runs, BLAST and FASTA

Including BLAST and FASTA, the most interesting fact is that FASTA found more than 2600 more results, that were not returned by any of the other search algorithms. This may be due to no restriction concerning the E-value using the default FASTA search. BLAST returned only one additional Hit, that was not found in any of the PSI-BLAST searches, but which is also included in the FASTA results. Any PSI-BLAST results were also detected by FASTA. Using PSIBLAST with 5 iterations, 6 more Hits were returned than using PSIBLAST with only 3 iterations. This may be due to the two additional iterations in which new sequences could be added to the search profile. All in all one could say, that the search algorithms returned a large amount of identical sequences matching our query and only due to different search strategies some of them found additional hits.

Only HHSearch returned not only quite few aligned sequences (10), but also sequences, that were not found by any of the other algorithms. The corresponding Venn Diagram is shown in Figure 4.

Figure 4: Venn Diagram showing the overlap of found sequences for HHSearch, Blast, Fasta and Psiblast


  • Identity Distribution

Figure 5 shows the Identity Distribution of the results from the different search tools.

Figure 5: Identity Distribution for the results from the Sequence Searches

As the PSIBlast runs with each 3 iterations resulted in the same Hits, as well as the PSI-BLAST runs with 5 iterations, the identity distributions for those runs were pooled using the same colour. Remarkable is the identity distribution for the FASTA run, which returned a lot more hits with little identity, that the other runs. All in all, FASTA returned almost 3000 Hits (default parameter search), while all the BLAST/PSIBLAST runs returned not more than 300 Hits each. Therefore a lot of the FASTA 'Hits' have little identity and a quite low E-value (see below), but still FASTA returned some good results.


  • Evalue Distribution

In Figure 6 the E-values for the sequence search results are displayed.

Figure 6: E-value Distribution for the results from the Sequence Searches

The E-value distribution for the sequence searches is quite similar concerning the BLAST and PSI-BLAST results. Only FASTA has a lot wider E-value range, which can be explained by the fact, that FASTA returned about 10 times more Hits, among which a lot of sequences have little identity and therefore a quite high E-value.


  • HSSP recall

To evaluate the outputs of the different alignment tools with the HSSP database, first a preprocessing step had to be made: HSSP uses Uniprot Identifiers, whereas all other alignment programs were run on the nr database, which includes identifiers from PDB, RefSeq, Swissprot, PIR, PFR and EMBL. The mapping was performed using the ID Mapping tool provided by NCBI[7]. Figure 7 shows the overlap of sequences found by the sequence search tools and sequences obtained from HSSP.

Figure 7: Overlap of sequences with HSSP

As expected from the large number of Hits in the HSSP file, most of the listed related proteins were not identified by the other alignment tools. The best overlap can be observed with PSIBLAST (5 iterations, but the same is true for the 3 iteration runs and BLAST, as their outputs are nearly identical), where 90% of the converted IDs are also returned by HSSP. A large fraction of FASTA is also covered in HSSP. But as FASTA was run without any E-value restriction it returned also a lot of sequences with low identity which are not likely to have the same structure and are therefore not found using the structural alignment tool HSSP.

Precision and recall for the different alignment methods calculated.

Blast Fasta HHSearch Psiblast,3,0.005 Psiblast,5,0.005 Psiblast,3,10E6 Psiblast,5,0.005
Precision 0.97 0.37 0.20 0.89 0.87 0.89 0.87
Recall 0.07 0.24 0 0.07 0.07 0.07 0.07

The highest precision is reached by BLAST, where 97% of all found sequences are also obtained via HSSP. On the other hand, the recall is very low, only 7% of possible homolog sequences are found. The highest recall is reached by FASTA with 24%, which was expected due to the large amount of returned results.

Sequences chosen for the multiple Alignment:

SeqIdentifier Seq Identity source
99-90% Sequence Identity
56967006|pdb|1X7Z 99% PSI BLAST, 3 iterations, E-value cutoff 0.005
7546384|pdb|1DTW 95% BLAST
34810149|pdb|1OLU 99% PSI BLAST, 3 iterations, E-value cutoff 10E-6
13277798|gb|AAH03787.1 95% PSI BLAST, 3 iterations, E-value cutoff 10E-6
148727347|ref|NP_001092034.1 95% BLAST
89-60% Sequence Identity
196011048|ref|XP_002115388.1 66% PSI BLAST, 3 iterations, E-value cutoff 0.005
149543950|ref|XP_001517857.1 67% BLAST
47227873|emb|CAG09036.1 82,5% FASTA
47196273|emb|CAF88112.1 81% PSI BLAST, 5 iterations, E-value cutoff 0.005
12964598|dbj|BAB32665.1 88% PSI BLAST, 5 iterations, E-value cutoff 10E-6
59-40% Sequence Identity
193290664|gb|ACF17640.1 47% BLAST
215431443|ref|ZP_03429362.1 40% FASTA
225557347|gb|EEH05633.1 51% PSI BLAST, 3 iterations, E-value cutoff 10E-6
58267618|ref|XP_570965.1 50% PSI BLAST, 5 iterations, E-value cutoff 0.005
162449842|ref|YP_001612209.1 41% PSI BLAST, 5 iterations, E-value cutoff 10E-6
39-20% Sequence Identity
56966700|pdb|1W85 31% PSI BLAST, 3 iterations, E-value cutoff 0.005
5822330|pdb|1QS0 38.1% FASTA
13516864|dbj|BAB40585.1 33% PSI BLAST, 3 iterations, E-value cutoff 10E-6
284166853|ref|YP_003405132.1 35% PSI BLAST, 5 iterations, E-value cutoff 0.005
76800932|ref|YP_325940.1 34% PSI BLAST, 5 iterations, E-value cutoff 10E-6

Sequences for the Multiple Sequences Alignment were downloaded via NCBI, the sequence id can be changed in the link to retrieve the fasta format: http://www.ncbi.nlm.nih.gov/protein/76800932?report=fasta

Multiple Alignments

The following tools were used to create a multiple sequence alignment:

clustalw sequences.fasta

t_coffee -seq sequences.fasta

t_coffee -seq sequences.fasta -mode expresso

muscle -in sequences.fasta -out output.aln

download cobalt

./cobalt -i sequences.fasta -norps T > output.aln

Conservation and Gaps

Alignment methods Gaps Conserved Columns
Gaps Avg Gap Length 100% cons >90% cons >80% cons >70% cons >60% cons >50% cons
ClustalW 12 3,75 24 50 31 49 54 72
T-Coffee 25 4,56 24 50 31 49 54 72
T-Coffee (3D) 56 4,75 21 45 34 49 64 71
Cobalt 19 3,26 24 55 31 45 60 71
Muscle 17 4,76 26 46 22 31 14 8

Gaps in secondary structure

ClustalW

The gaps in secondary structure elements for the ClustalW alignment are shown in Figure 8

Figure 8: ClustalW gaps in secondary structure elements [1]
Gap position Gap length Secondary structure
109-110 4 Helix
142-143 1 Helix
235-236 1 Beta strand
276-277 11 Helix
294-295 1 Beta strand
394-395 5 Helix

T-Coffee

The gaps in secondary structure elements for the T-Coffee alignment are shown in Figure 9

Fogure 9: T-Coffee gaps in secondary structure elements [2]
Gap position Gap length Secondary structure
141-142 1 Helix
232-233 1 Beta strand
275-276 11 Helix
310-311 1 Helix
369-370 5 Turn
395-396 18 Helix
398-399 5 Helix

T-Coffee 3d

The gaps in secondary structure elements for the T-Coffee (3D) alignment are shown in Figure 10

Figure 10: T-Coffee (3D) gaps in secondary structure elements [3]
Gap position Gap length Secondary structure
101-102 1 Helix
108-109 4 Helix
115-116 1 Helix
116-117 1 Helix
141-142 1 Helix
153-154 1 Beta strand
163-164 1 Helix
177-178 3 Helix
234-235 1 Beta strand
263-264 4 Beta strand
265-266 1 Beta strand
276-277 2 Helix
308-309 8 Helix
309-310 5 Helix
314-315 6 Helix
362-363 6 Helix
371-372 4 Turn
376-377 1 Helix
380-381 1 Helix
382-383 7 Helix
383-384 3 Helix
384-385 2 Helix
387-388 2 Helix
394-395 5 Helix

Cobalt

The gaps in secondary structure elements for the Cobalt alignment are shown in Figure 11

Figure 11: Cobalt gaps in secondary structure elements [4]
Gap position Gap length Secondary structure
108-109 4 Helix
141-142 1 Helix
177-178 3 Helix
276-277 11 Helix
294-295 1 Beta strand
305-306 1 Helix
311-312 2 Helix
387-388 1 Helix
388-389 1 Helix
395-396 13 Helix

Muscle

The gaps in secondary structure elements for the Muscle alignment are shown in Figure 12

Figure 12: Muscle gaps and structure [5]
Gap position Gap length Secondary structure
109-110 4 Helix
141-142 1 Helix
177-178 3 Helix
276-277 11 Helix
294-295 1 Beta strand
305-306 1 Helix
394-395 5 Helix

Discussion

Functionally important residues

The functionally important sites are according to Uniprot[8] the following sites:

  • Metal binding site, S: 206 (161)
  • Metal binding site, Q: 211 (166)
  • Metal binding site, I: 212 (167)

As the Uniprot sequence is 445 aa long and the PDB sequence (1U5B) only 400 aa (without transit sequence), one has to consider the offset. The functional site positions in brackets are used to determine their conservation in the multiple sequence alignments.

Conservation ClustalW Cobalt Muscle T-Coffee T-Coffee 3D
Site 161 (S) 20/21 16/21 16/21 16/21 15/21
Site 166 (Q) 21/21 21/21 21/21 21/21 21/21
Site 161 (I) 18/21 14/21 14/21 14/21 14/21

Figures 13-17 show parts (including the functional sites) of the multiple sequence alignments, visualized with Jalview.

Figure 13: ClustalW multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 14: Cobalt multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 15: Muscle multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 306, 312, 313 in the msa.
Figure 16: T-Coffee multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 268, 274, 275 in the msa.
Figure 17: T-Coffee 3D multiple sequence alignment: The conserved sites 161, 166, 167 in the amino acid sequence correspond to the positions 423, 430, 431 in the msa.

References

Secondary structure information

back to Reference Sequence of BCKDHA

back to Maple syrup urine disease main page

go to Task 3 Secondary Structure Prediction BCKDHA