Difference between revisions of "Task 2 (MSUD)"

From Bioinformatikpedia
(BCKDHA)
(Results)
Line 54: Line 54:
 
File:SCOP histogram blast BCKDHB.png|Distribution of SCOP fold in BLAST hits
 
File:SCOP histogram blast BCKDHB.png|Distribution of SCOP fold in BLAST hits
 
File:SCOP histogram psiblast(iter. 10, e-val. 10e-10) BCKDHB.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
 
File:SCOP histogram psiblast(iter. 10, e-val. 10e-10) BCKDHB.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
  +
</gallery>
  +
  +
<gallery widths=335px heights=267px perrow=3 caption="Top-5 common GO terms in hits with GO annotation">
  +
File:GO histogram blast BCKDHB.png|Top-5 common GO terms in BLAST hits
  +
File:GO histogram psiblast(iter. 2, e-val. 0.002) BCKDHB.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
  +
File:GO histogram psiblast(iter. 10, e-val. 10e-10) BCKDHB.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
 
</gallery>
 
</gallery>
   
Line 73: Line 79:
 
<gallery widths=500px heights=400px perrow=2 caption="Distribution of SCOP folds">
 
<gallery widths=500px heights=400px perrow=2 caption="Distribution of SCOP folds">
 
File:SCOP histogram psiblast(iter. 10, e-val. 0.002) DBT.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 0.002). This is the only test run which have PDB hits with SCOP classification.
 
File:SCOP histogram psiblast(iter. 10, e-val. 0.002) DBT.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 0.002). This is the only test run which have PDB hits with SCOP classification.
  +
</gallery>
  +
  +
<gallery widths=335px heights=267px perrow=3 caption="Top-5 common GO terms in hits with GO annotation">
  +
File:GO histogram blast DBT.png|Top-5 common GO terms in BLAST hits
  +
File:GO histogram psiblast(iter. 2, e-val. 0.002) DBT.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
  +
File:GO histogram psiblast(iter. 10, e-val. 10e-10) DBT.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
 
</gallery>
 
</gallery>
   
Line 93: Line 105:
 
File:SCOP histogram blast DLD.png|Distribution of SCOP fold in BLAST hits
 
File:SCOP histogram blast DLD.png|Distribution of SCOP fold in BLAST hits
 
File:SCOP histogram psiblast(iter. 10, e-val. 10e-10) DLD.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
 
File:SCOP histogram psiblast(iter. 10, e-val. 10e-10) DLD.png|Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
  +
</gallery>
  +
  +
<gallery widths=335px heights=267px perrow=3 caption="Top-5 common GO terms in hits with GO annotation">
  +
File:GO histogram blast DLD.png|Top-5 common GO terms in BLAST hits
  +
File:GO histogram psiblast(iter. 2, e-val. 0.002) DLD.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
  +
File:GO histogram psiblast(iter. 10, e-val. 10e-10) DLD.png|Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
 
</gallery>
 
</gallery>
   

Revision as of 23:50, 5 May 2013

Sequence searches

lab journal

Results

The query sequences for the 4 subunits of BCKDC locate at /mnt/home/student/weish/master-practical-2013/task01/.

Results for sequence search locate in the directory /mnt/home/student/weish/master-practical-2013/task02/01-seq-search/results. For BLAST and PSI-BLAST, statistics (such as E-value, probability and identity) are stored in *.tsv files. Detailed results are shown in xml files. For HHBlits, the *.hhr files contain information about statistics and hits.

BCKDHA

BCKDHB

DBT

DLD

Discussion

Multiple sequence alignments

lab journal

Results

The datsets can be found in /mnt/home/student/schillerl/MasterPractical/task2/datasets/.

The MSAs are located at /mnt/home/student/schillerl/MasterPractical/task2/MSAs/.

In the following sections the MSAs, visualised with Jalview, are shown.

BCKDHA

Low sequence identity

Mafft: MSUD BCKDHA low seq ident mafft.png

Muscle: MSUD BCKDHA low seq ident muscle.png

T-Coffee: MSUD BCKDHA low seq ident tcoffee.png

High sequence identity

Mafft: MSUD BCKDHA high seq ident mafft.png

Muscle: MSUD BCKDHA high seq ident muscle.png

T-Coffee: MSUD BCKDHA high seq ident tcoffee.png

Whole range sequence identity

Mafft: MSUD BCKDHA whole range seq ident mafft.png

Muscle: MSUD BCKDHA whole range seq ident muscle.png

T-Coffee: MSUD BCKDHA whole range seq ident tcoffee.png

BCKDHB

Low sequence identity

Mafft: MSUD BCKDHB low seq ident mafft.png

Muscle: MSUD BCKDHB low seq ident muscle.png

T-Coffee: MSUD BCKDHB low seq ident tcoffee.png

High sequence identity

Mafft: MSUD BCKDHB high seq ident mafft.png

Muscle: MSUD BCKDHB high seq ident muscle.png

T-Coffee: MSUD BCKDHB high seq ident tcoffee.png

Whole range sequence identity

Mafft: MSUD BCKDHB whole range seq ident mafft.png

Muscle: MSUD BCKDHB whole range seq ident muscle.png

T-Coffee: MSUD BCKDHB whole range seq ident tcoffee.png

DBT

Low sequence identity

Mafft: MSUD DBT low seq ident mafft.png

Muscle: MSUD DBT low seq ident muscle.png

T-Coffee: MSUD DBT low seq ident tcoffee.png

High sequence identity

Mafft: MSUD DBT high seq ident mafft.png

Muscle: MSUD DBT high seq ident muscle.png

T-Coffee: MSUD DBT high seq ident tcoffee.png

Whole range sequence identity

Mafft: MSUD DBT whole range seq ident mafft.png

Muscle: MSUD DBT whole range seq ident muscle.png

T-Coffee: MSUD DBT whole range seq ident tcoffee.png

DLD

Low sequence identity

Mafft: MSUD DLD low seq ident mafft.png

Muscle: MSUD DLD low seq ident muscle.png

T-Coffee: MSUD DLD low seq ident tcoffee.png

High sequence identity

Mafft: MSUD DLD high seq ident mafft.png

Muscle: MSUD DLD high seq ident muscle.png

T-Coffee: MSUD DLD high seq ident tcoffee.png

Whole range sequence identity

Mafft: MSUD DLD whole range seq ident mafft.png

Muscle: MSUD DLD whole range seq ident muscle.png

T-Coffee: MSUD DLD whole range seq ident tcoffee.png

Discussion

For the datasets with high sequence identity the three MSA programs Mafft, Muscle and T-Coffee come to similar results and find almost the same conserved blocks. Sometimes T-Coffee arranges gaps differently than the others and so does not find as much conserved columns. Especially at the ends of the sequences, the results of the programs differ a little. This is due to different scoring schemes that are used in the programs.

For low sequence identity, the programs have problems to find the right alignment. They don't agree in the position of gaps and also sometimes find different conserved columns. They don't cope with low similarity and so one can't really rely on these results.

For whole range sequence identity the results are similar w. r. t. many and different gaps at the ends of the sequences, but the programs agree more in the conserved columns that they find.

The results of Muscle and Mafft seem more similar to each other than to those of T-Coffee. T-Coffee often treats the ends of the sequences, which have low sequence identity, differently than the others. It is striking that almost always the alignment of Muscle has the shortest and the one of T-Coffee has the highest length, especially in cases with low sequence identity. If an alignment is very long, this means there are many gaps and less aligned residues, this might be a sign of bad alignment quality.

Altogether, there appear regions with many conserved columns and those with many gaps. The conserved blocks or columns correspond to secondary structure elements and functionally important residues, respectively. Gaps in the alignment appear in regions where there are loops in the structure of the protein, so that insertions or deletions that occur during evolution don't alter the overall structure or function of the protein.

As criteria for a good alignment one could run different alignment algorithms like in this task and compare the results. If one of them finds more conserved columns, this might be better than another. Different programs can be better than others if different datasets are used, so it is always a good idea to try more than one algorithm and pick out the best result. Mafft is often a good choice because it generated relatively precise results but still is very fast.