Difference between revisions of "Task 2 (MSUD)"

From Bioinformatikpedia
(BCKDHB)
(Discussion)
 
(42 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Sequence searches ==
 
== Sequence searches ==
   
[[Task 2 lab journal (MSUD)#Sequence searches|lab journal]]
+
[[Task 2 lab journal (MSUD)#Sequence searches|Lab journal]]
   
 
=== Results ===
 
=== Results ===
  +
We have performed sequence search experiments for all of the 4 subunits of BCKDC. In this page, we mainly describe and discuss the results for the subunit BCKDHA. Results and discussions for other 3 subunits are covered in this page: [[Task 2 (MUSD) Additional Results|Additional Results]].
   
  +
==== Distributions of E-value and sequence identity ====
The query sequences for the 4 subunits of BCKDC locate at <nowiki>/mnt/home/student/weish/master-practical-2013/task01/</nowiki>.
 
  +
<gallery widths=500px heights=411px caption="E-value and identity distribution for different sequence search methods">
 
Results for sequence search locate in the directory <nowiki>/mnt/home/student/weish/master-practical-2013/task02/01-seq-search/results</nowiki>.
 
For BLAST and PSI-BLAST, statistics (such as E-value, probability and identity) are stored in <nowiki>*.tsv</nowiki> files. Detailed results are shown in xml files. For HHBlits, the <nowiki>*.hhr</nowiki> files contain information about statistics and hits.
 
 
==== BCKDHA ====
 
<gallery widths=390px heights=320px caption="E-value and identity distribution for different sequence search methods">
 
 
File:E-value-distribution BCKDHA.png|E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
 
File:E-value-distribution BCKDHA.png|E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
 
File:Identity distribution BCKDHA.png|Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
 
File:Identity distribution BCKDHA.png|Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
 
</gallery>
 
</gallery>
   
<gallery widths=358px heights=293px perrow=2 caption="Intersection of hits between different sequence search methods">
+
==== Intersection of hits ====
  +
<gallery widths=500px heights=400px perrow=2 caption="Intersection of hits between different sequence search methods">
 
File:Intersection to blast BCKDHA.png|Relative intersection of hits between BLAST and other sequence search methods.
 
File:Intersection to blast BCKDHA.png|Relative intersection of hits between BLAST and other sequence search methods.
 
File:Intersection to psiblast(iter. 2, e-val. 0.002) BCKDHA.png|Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
 
File:Intersection to psiblast(iter. 2, e-val. 0.002) BCKDHA.png|Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Line 25: Line 22:
 
</gallery>
 
</gallery>
   
  +
==== Evaluation through structure and function ====
==== BCKDHB ====
 
<gallery widths=390px heights=320px caption="E-value and identity distribution for different sequence search methods">
+
<gallery widths=500px heights=400px perrow=2 caption="Distribution of SCOP folds">
  +
File:SCOP histogram blast BCKDHA.png|Distribution of SCOP fold in BLAST hits(only one classified PDB structure was found in SCOP)
File:E-value-distribution BCKDHB.png|E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)
 
  +
File:SCOP histogram psiblast(iter. 2, e-val. 0.002) BCKDHA.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 2, E-value 0.002)
File:Identity distribution BCKDHB.png|Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)
 
 
</gallery>
 
</gallery>
   
<gallery widths=358px heights=293px perrow=2 caption="Intersection of hits between different sequence search methods">
+
<gallery widths=335px heights=267px perrow=3 caption="Top-5 common GO terms in hits with GO annotation">
File:Intersection to blast BCKDHB.png|Relative intersection of hits between BLAST and other sequence search methods.
+
File:GO histogram blast BCKDHA.png|Top-5 common GO terms in BLAST hits
File:Intersection to psiblast(iter. 2, e-val. 0.002) BCKDHB.png|Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
+
File:GO histogram psiblast(iter. 2, e-val. 0.002) BCKDHA.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
File:Intersection to psiblast(iter. 2, e-val. 10e-10) BCKDHB.png|Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
+
File:GO histogram psiblast(iter. 10, e-val. 10e-10) BCKDHA.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
File:Intersection to psiblast(iter. 10, e-val. 0.002) BCKDHB.png|Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
 
File:Intersection to psiblast(iter. 10, e-val. 10e-10) BCKDHB.png|Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
 
File:Intersection to hhblits BCKDHB.png|Relative intersection between HHBlits and other sequence search methods.
 
 
</gallery>
 
</gallery>
   
==== DBT ====
+
=== Discussion ===
  +
* E-value distribution:
  +
** Very few hits were found with very low E-values. These hits show high statistical significance.
  +
** Because that different databases were used for BLAST/PSI-BLAST and HHBLits, hhblits has a set of hits with larger range of e-value.
  +
*** E-value distribution of PSI-BLAST shift to low E-value side with more iterations. Although the hits are statistically more significant, but the biological significance should be tested. If more iterations were used the shift could be even larger, so the overlap between statistical hits and biological homologs must be evaluated. A proper number of iterations should be selected.
   
  +
* Identity distribution
==== DLD ====
 
  +
** Results show that BLAST depends mostly on sequence identity. Homologs with low sequence identity but high biological similarity could be lost.
   
  +
* Intersection of hits
=== Discussion ===
 
  +
** PSI-BLAST with 2 iterations has bigger intersection with BLAST.
  +
** Two PSI-BLAST run with 2 iterations and different E-value cutoffs have very similar set of hits.
  +
** PSI-BLAST with 10 iterations has smaller intersection with BLAST.
  +
** Two PSI-BLAST runs with 10 iterations and different E-value cutoffs share the fewest common hits. The explanation could be, the E-value cutoff may have higher influence than the number of iterations.
  +
  +
* SCOP of hit sequences
  +
** Both BLAST and PSI-BLAST find the right fold class for BCKDHA.
  +
** PSI-BLAST finds more hits in the fold class that describes the query protein best. Most hits have c.36 which is for Thiamin diphosphate-binding fold. This fold classification is just the main binding function of BCKDHA.
  +
** PSI-BLAST also find hits in more fold classes which may describe biological similarities of domains and motives between hits and query protein.
  +
  +
* Gene Ontology of hit proteins
  +
** Top-5 GO terms in hits of PSI-BLAST with different iterations are more conserved. They also have similar ranking of frequency.
  +
** PSI-BLAST finds out hits with more GO terms. It may be more sensitive to functional patterns in sequence.
   
 
== Multiple sequence alignments ==
 
== Multiple sequence alignments ==
   
[[Task 2 lab journal (MSUD)#Multiple sequence alignments|lab journal]]
+
[[Task 2 lab journal (MSUD)#Multiple sequence alignments|Lab journal]]
   
 
=== Results ===
 
=== Results ===
   
  +
We have created MSAs for all of the 4 subunits of BCKDC. In this page, we mainly describe and discuss the results for the subunit BCKDHA. Results and discussions for other 3 subunits are covered in this page: [[Task 2 (MUSD) Additional Results|Additional Results]].
The datsets can be found in <code>/mnt/home/student/schillerl/MasterPractical/task2/datasets/</code>.
 
 
The MSAs are located at <code>/mnt/home/student/schillerl/MasterPractical/task2/MSAs/</code>.
 
   
 
In the following sections the MSAs, visualised with [http://www.jalview.org/ Jalview], are shown.
 
In the following sections the MSAs, visualised with [http://www.jalview.org/ Jalview], are shown.
   
==== BCKDHA ====
+
==== Low sequence identity ====
 
===== Low sequence identity =====
 
   
 
Mafft:
 
Mafft:
Line 71: Line 79:
 
[[Image:MSUD_BCKDHA_low_seq_ident_tcoffee.png]]
 
[[Image:MSUD_BCKDHA_low_seq_ident_tcoffee.png]]
   
  +
Espresso:
===== High sequence identity =====
 
  +
[[Image:MSUD_BCKDHA_low_seq_ident_espresso.png]]
  +
  +
==== High sequence identity ====
   
 
Mafft:
 
Mafft:
Line 82: Line 93:
 
[[Image:MSUD_BCKDHA_high_seq_ident_tcoffee.png]]
 
[[Image:MSUD_BCKDHA_high_seq_ident_tcoffee.png]]
   
  +
Espresso:
===== Whole range sequence identity =====
 
  +
[[Image:MSUD_BCKDHA_high_seq_ident_espresso.png]]
  +
  +
==== Whole range sequence identity ====
   
 
Mafft:
 
Mafft:
Line 93: Line 107:
 
[[Image:MSUD_BCKDHA_whole_range_seq_ident_tcoffee.png]]
 
[[Image:MSUD_BCKDHA_whole_range_seq_ident_tcoffee.png]]
   
  +
Espresso:
==== BCKDHB ====
 
  +
[[Image:MSUD_BCKDHA_whole_range_seq_ident_espresso.png]]
 
===== Low sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_BCKDHB_low_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_BCKDHB_low_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_BCKDHB_low_seq_ident_tcoffee.png]]
 
 
===== High sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_BCKDHB_high_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_BCKDHB_high_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_BCKDHB_high_seq_ident_tcoffee.png]]
 
 
===== Whole range sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_BCKDHB_whole_range_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_BCKDHB_whole_range_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_BCKDHB_whole_range_seq_ident_tcoffee.png]]
 
 
==== DBT ====
 
 
===== Low sequence identity =====
 
 
===== High sequence identity =====
 
 
===== Whole range sequence identity =====
 
 
==== DLD ====
 
 
===== Low sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_DLD_low_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_DLD_low_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_DLD_low_seq_ident_tcoffee.png]]
 
 
===== High sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_DLD_high_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_DLD_high_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_DLD_high_seq_ident_tcoffee.png]]
 
 
===== Whole range sequence identity =====
 
 
Mafft:
 
[[Image:MSUD_DLD_whole_range_seq_ident_mafft.png]]
 
 
Muscle:
 
[[Image:MSUD_DLD_whole_range_seq_ident_muscle.png]]
 
 
T-Coffee:
 
[[Image:MSUD_DLD_whole_range_seq_ident_tcoffee.png]]
 
   
 
=== Discussion ===
 
=== Discussion ===
Line 175: Line 114:
 
For the datasets with high sequence identity the three MSA programs Mafft, Muscle and T-Coffee come to similar results and find almost the same conserved blocks. Sometimes T-Coffee arranges gaps differently than the others and so does not find as much conserved columns. Especially at the ends of the sequences, the results of the programs differ a little. This is due to different scoring schemes that are used in the programs.
 
For the datasets with high sequence identity the three MSA programs Mafft, Muscle and T-Coffee come to similar results and find almost the same conserved blocks. Sometimes T-Coffee arranges gaps differently than the others and so does not find as much conserved columns. Especially at the ends of the sequences, the results of the programs differ a little. This is due to different scoring schemes that are used in the programs.
   
For low sequence identity, the programs have problems to find the right alignment. They don't agree in the position of gaps and also sometimes find different conserved columns. They don't cope with low similarity and so one can't really rely on these results.
+
For low sequence identity, the programs have problems to find the right alignment. They do not agree in the position of gaps and also sometimes find different conserved columns. They do not cope with low similarity and so one cannot really rely on these results. Here structural information, as it is used in Espresso (which belongs to T-Coffee), can help to find the right alignment: Espresso can align more residues than T-Coffee. For whole range sequence identity the results are similar w. r. t. many and different gaps at the ends of the sequences, but the programs agree more in the conserved columns that they find.
 
For whole range sequence identity the results are similar w. r. t. many and different gaps at the ends of the sequences, but the programs agree more in the conserved columns that they find.
 
   
The results of Muscle and Mafft seem more similar to each other than to those of T-Coffee. T-Coffee often treats the ends of the sequences, which have low sequence identity, differently than the others. It is striking that almost always the alignment of Muscle has the shortest and the one of T-Coffee has the highest length, especially in cases with low sequence identity. If an alignment is very long, this means there are many gaps and less aligned residues, this might be a sign of bad alignment quality.
+
The results of Muscle and Mafft seem more similar to each other than to those of T-Coffee. T-Coffee often treats the ends of the sequences, which have low sequence identity, differently than the others. It is striking that almost always the alignment of Muscle has the shortest length, especially in cases with low sequence identity. If an alignment is very long, this means there are many gaps and less aligned residues, this might be a sign of bad alignment quality.
   
Altogether, there appear regions with many conserved columns and those with many gaps. The conserved blocks or columns correspond to secondary structure elements and functionally important residues, respectively. Gaps in the alignment appear in regions where there are loops in the structure of the protein, so that insertions or deletions that occur during evolution don't alter the overall structure or function of the protein.
+
Altogether, there appear regions with many conserved columns and those with many gaps. The conserved blocks or columns correspond to secondary structure elements and functionally important residues, respectively. Gaps in the alignment appear in regions where there are loops in the structure of the protein, so that insertions or deletions that occur during evolution do not alter the overall structure or function of the protein. We compared the alignments with the DSSP secondary structure assignment (see [[Task 3 (MSUD)#Secondary structure|Task 3]]) and found that in fact most gaps do not lie in secondary structure elements - there are only some exceptions for alignments with distant relatives and near the end of the protein sequence. Also, the region around residues 157-159 (corresponding to positions 165-167 in the low sequence identity MSA derived from Mafft), which is a thiamine disphosphate binding region according to [http://www.uniprot.org/uniprot/P12694 Uniprot], is conserved, especially ARG159.
   
 
As criteria for a good alignment one could run different alignment algorithms like in this task and compare the results. If one of them finds more conserved columns, this might be better than another. Different programs can be better than others if different datasets are used, so it is always a good idea to try more than one algorithm and pick out the best result. Mafft is often a good choice because it generated relatively precise results but still is very fast.
 
As criteria for a good alignment one could run different alignment algorithms like in this task and compare the results. If one of them finds more conserved columns, this might be better than another. Different programs can be better than others if different datasets are used, so it is always a good idea to try more than one algorithm and pick out the best result. Mafft is often a good choice because it generated relatively precise results but still is very fast.

Latest revision as of 11:53, 28 August 2013

Sequence searches

Lab journal

Results

We have performed sequence search experiments for all of the 4 subunits of BCKDC. In this page, we mainly describe and discuss the results for the subunit BCKDHA. Results and discussions for other 3 subunits are covered in this page: Additional Results.

Distributions of E-value and sequence identity

Intersection of hits

Evaluation through structure and function

Discussion

  • E-value distribution:
    • Very few hits were found with very low E-values. These hits show high statistical significance.
    • Because that different databases were used for BLAST/PSI-BLAST and HHBLits, hhblits has a set of hits with larger range of e-value.
      • E-value distribution of PSI-BLAST shift to low E-value side with more iterations. Although the hits are statistically more significant, but the biological significance should be tested. If more iterations were used the shift could be even larger, so the overlap between statistical hits and biological homologs must be evaluated. A proper number of iterations should be selected.
  • Identity distribution
    • Results show that BLAST depends mostly on sequence identity. Homologs with low sequence identity but high biological similarity could be lost.
  • Intersection of hits
    • PSI-BLAST with 2 iterations has bigger intersection with BLAST.
    • Two PSI-BLAST run with 2 iterations and different E-value cutoffs have very similar set of hits.
    • PSI-BLAST with 10 iterations has smaller intersection with BLAST.
    • Two PSI-BLAST runs with 10 iterations and different E-value cutoffs share the fewest common hits. The explanation could be, the E-value cutoff may have higher influence than the number of iterations.
  • SCOP of hit sequences
    • Both BLAST and PSI-BLAST find the right fold class for BCKDHA.
    • PSI-BLAST finds more hits in the fold class that describes the query protein best. Most hits have c.36 which is for Thiamin diphosphate-binding fold. This fold classification is just the main binding function of BCKDHA.
    • PSI-BLAST also find hits in more fold classes which may describe biological similarities of domains and motives between hits and query protein.
  • Gene Ontology of hit proteins
    • Top-5 GO terms in hits of PSI-BLAST with different iterations are more conserved. They also have similar ranking of frequency.
    • PSI-BLAST finds out hits with more GO terms. It may be more sensitive to functional patterns in sequence.

Multiple sequence alignments

Lab journal

Results

We have created MSAs for all of the 4 subunits of BCKDC. In this page, we mainly describe and discuss the results for the subunit BCKDHA. Results and discussions for other 3 subunits are covered in this page: Additional Results.

In the following sections the MSAs, visualised with Jalview, are shown.

Low sequence identity

Mafft: MSUD BCKDHA low seq ident mafft.png

Muscle: MSUD BCKDHA low seq ident muscle.png

T-Coffee: MSUD BCKDHA low seq ident tcoffee.png

Espresso: MSUD BCKDHA low seq ident espresso.png

High sequence identity

Mafft: MSUD BCKDHA high seq ident mafft.png

Muscle: MSUD BCKDHA high seq ident muscle.png

T-Coffee: MSUD BCKDHA high seq ident tcoffee.png

Espresso: MSUD BCKDHA high seq ident espresso.png

Whole range sequence identity

Mafft: MSUD BCKDHA whole range seq ident mafft.png

Muscle: MSUD BCKDHA whole range seq ident muscle.png

T-Coffee: MSUD BCKDHA whole range seq ident tcoffee.png

Espresso: MSUD BCKDHA whole range seq ident espresso.png

Discussion

For the datasets with high sequence identity the three MSA programs Mafft, Muscle and T-Coffee come to similar results and find almost the same conserved blocks. Sometimes T-Coffee arranges gaps differently than the others and so does not find as much conserved columns. Especially at the ends of the sequences, the results of the programs differ a little. This is due to different scoring schemes that are used in the programs.

For low sequence identity, the programs have problems to find the right alignment. They do not agree in the position of gaps and also sometimes find different conserved columns. They do not cope with low similarity and so one cannot really rely on these results. Here structural information, as it is used in Espresso (which belongs to T-Coffee), can help to find the right alignment: Espresso can align more residues than T-Coffee. For whole range sequence identity the results are similar w. r. t. many and different gaps at the ends of the sequences, but the programs agree more in the conserved columns that they find.

The results of Muscle and Mafft seem more similar to each other than to those of T-Coffee. T-Coffee often treats the ends of the sequences, which have low sequence identity, differently than the others. It is striking that almost always the alignment of Muscle has the shortest length, especially in cases with low sequence identity. If an alignment is very long, this means there are many gaps and less aligned residues, this might be a sign of bad alignment quality.

Altogether, there appear regions with many conserved columns and those with many gaps. The conserved blocks or columns correspond to secondary structure elements and functionally important residues, respectively. Gaps in the alignment appear in regions where there are loops in the structure of the protein, so that insertions or deletions that occur during evolution do not alter the overall structure or function of the protein. We compared the alignments with the DSSP secondary structure assignment (see Task 3) and found that in fact most gaps do not lie in secondary structure elements - there are only some exceptions for alignments with distant relatives and near the end of the protein sequence. Also, the region around residues 157-159 (corresponding to positions 165-167 in the low sequence identity MSA derived from Mafft), which is a thiamine disphosphate binding region according to Uniprot, is conserved, especially ARG159.

As criteria for a good alignment one could run different alignment algorithms like in this task and compare the results. If one of them finds more conserved columns, this might be better than another. Different programs can be better than others if different datasets are used, so it is always a good idea to try more than one algorithm and pick out the best result. Mafft is often a good choice because it generated relatively precise results but still is very fast.