Difference between revisions of "Task 2 (MUSD) Additional Results"

Latest revision as of 14:04, 27 August 2013

This page shows the results for sequence search and MSA analyses that we performed with the other three protein subunits BCKDHB, DBT and DLD.

Sequence searches

Lab journal

Results

The query sequences for the 4 subunits of BCKDC locate at /mnt/home/student/weish/master-practical-2013/task01/.

Results for sequence search locate in the directory /mnt/home/student/weish/master-practical-2013/task02/01-seq-search/results. For BLAST and PSI-BLAST, statistics (such as E-value, probability and identity) are stored in *.tsv files. Detailed results are shown in xml files. For HHBlits, the *.hhr files contain information about statistics and hits.

BCKDHB

E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)

Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.

Distribution of SCOP folds
Distribution of SCOP fold in BLAST hits
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)

Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)

DBT

E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of DBT)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of DBT)

Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.

Distribution of SCOP folds
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 0.002). This is the only test run which have PDB hits with SCOP classification.

Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)

DLD

E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of DLD)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of DLD)

Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.

Distribution of SCOP folds
Distribution of SCOP fold in BLAST hits
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)

Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)

Discussion

E-value distribution:
- Very few hits were found with very low E-values. These hits have very high statistical significance.
- Because different databases were used for BLAST/PSI-BLAST and HHBLits, hhblits has found hits with larger range of e-value. HHblits found more frequently hits with high E-value.
- For protein BCKDHB, PSI-BLAST tends to find out more hits with intermediate E-value (1e-106 to 1e-25).
  - E-value distribution of PSI-BLAST shift to low E-value side with more iterations. If we have chosen more iterations for PSI-Blast the shift of result space might be even larger. This phenomenon may cause PSI-blast to find out more hits that are over fitted to the statistical model but less hits with high biological significance. So there is a trade off between iteration number and result quality.

Identity distribution
- Results show that BLAST depends mostly on sequence identity because most hits shift towards higher sequence identity. Hits with low sequence identity but high biological similarity may be lose in this case.

Intersection of hits
- HHBlits was not comparable to other methods due to different sequence database. We were not able to find out the sequence ID from the results for HHBlits, which is important for getting the intersection of results.
- PSI-BLAST with 2 iterations has a result set of larger intersection with the result set of BLAST.
- Two PSI-BLAST runs with 2 iterations and different E-value cutoffs have very similar set of hits.
- PSI-BLAST with 10 iterations has less intersection with BLAST. This might be explained by the shift of result space by using PSI-Blast with higher number of iterations.
- Two PSI-BLAST runs with 10 iterations and different E-value cutoffs share the fewest common hits. It is possibly due to the effect of choice of E-value cutoff.

SCOP of hit sequences
- PDB sequence required -> no evaluation for HHBlits
- Both BLAST and PSI-BLAST find the right fold class for the query proteins.
- PSI-BLAST generally find more hits in the fold class that describes the query protein best (e.g. DLD protein, c.3 is FAD/NAD(P)-binding domain)
- PSI-BLAST also finds out hits in more fold classes which may describe domains of query protein.
- PSI-Blast seems to find out more hits with biological significance.

Gene Ontology of hit proteins
- Top-5 GO terms in hits of PSI-BLAST with different iterations are more conserved. They also have similar ranking of frequency.
- PSI-BLAST finds out hits with more GO terms. It may be more sensitive to functional patterns in sequence. Another explanation could be, due to the shift effect of result space, more irrelevant hits might be included which come with more different GO terms.

Multiple sequence alignments

Lab journal

Results

In the following sections the MSAs, visualised with Jalview, are shown.

BCKDHB

Low sequence identity

Mafft:

Muscle:

T-Coffee:

Espresso:

High sequence identity

Mafft:

Muscle:

T-Coffee:

Espresso:

Whole range sequence identity

Mafft:

Muscle:

T-Coffee:

Espresso:

DBT

Low sequence identity

Mafft:

Muscle:

T-Coffee:

High sequence identity

Mafft:

Muscle:

T-Coffee:

Whole range sequence identity

Mafft:

Muscle:

T-Coffee:

DLD

Low sequence identity

Mafft:

Muscle:

T-Coffee:

High sequence identity

Mafft:

Muscle:

T-Coffee:

Whole range sequence identity

Mafft:

Muscle:

T-Coffee:

@@ Line 1: / Line 1: @@
+This page shows the results for sequence search and MSA analyses that we performed with the other three protein subunits BCKDHB, DBT and DLD.
 == Sequence searches ==
@@ Line 87: / Line 89: @@
 === Discussion ===
+*E-value distribution:
+** Very few hits were found with very low E-values. These hits have very high statistical significance.
+** Because different databases were used for BLAST/PSI-BLAST and HHBLits, hhblits has found hits with larger range of e-value. HHblits found more frequently hits with high E-value.
+** For protein BCKDHB, PSI-BLAST tends to find out more hits with intermediate E-value (1e-106 to 1e-25).
+*** E-value distribution of PSI-BLAST shift to low E-value side with more iterations. If we have chosen more iterations for PSI-Blast the shift of result space might be even larger. This phenomenon may cause PSI-blast to find out more hits that are over fitted to the statistical model but less hits with high biological significance. So there is a trade off between iteration number and result quality.
+* Identity distribution
-== Multiple sequence alignment ==
+** Results show that BLAST depends mostly on sequence identity because most hits shift towards higher sequence identity. Hits with low sequence identity but high biological similarity may be lose in this case.
+* Intersection of hits
+** HHBlits was not comparable to other methods due to different sequence database. We were not able to find out the sequence ID from the results for HHBlits, which is important for getting the intersection of results.
+** PSI-BLAST with 2 iterations has a result set of larger intersection with the result set of BLAST.
+** Two PSI-BLAST runs with 2 iterations and different E-value cutoffs have very similar set of hits.
+** PSI-BLAST with 10 iterations has less intersection with BLAST. This might be explained by the shift of result space by using PSI-Blast with higher number of iterations.
+** Two PSI-BLAST runs with 10 iterations and different E-value cutoffs share the fewest common hits. It is possibly due to the effect of choice of E-value cutoff.
+* SCOP of hit sequences
+** PDB sequence required -> no evaluation for HHBlits
+** Both BLAST and PSI-BLAST find the right fold class for the query proteins.
+** PSI-BLAST generally find more hits in the fold class that describes the query protein best (e.g. DLD protein, c.3 is FAD/NAD(P)-binding domain)
+** PSI-BLAST also finds out hits in more fold classes which may describe domains of query protein.
+** PSI-Blast seems to find out more hits with biological significance.
+* Gene Ontology of hit proteins
+** Top-5 GO terms in hits of PSI-BLAST with different iterations are more conserved. They also have similar ranking of frequency.
+** PSI-BLAST finds out hits with more GO terms. It may be more sensitive to functional patterns in sequence. Another explanation could be, due to the shift effect of result space, more irrelevant hits might be included which come with more different GO terms.
+== Multiple sequence alignments ==
+[[Task 2 lab journal (MSUD)#Multiple sequence alignments|Lab journal]]
 === Results ===
+In the following sections the MSAs, visualised with [http://www.jalview.org/ Jalview], are shown.
-=== Discussion ===
+==== BCKDHB ====
+===== Low sequence identity =====
+Mafft:
+[[Image:MSUD_BCKDHB_low_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_BCKDHB_low_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_BCKDHB_low_seq_ident_tcoffee.png]]
+Espresso:
+[[Image:MSUD_BCKDHB_low_seq_ident_espresso.png]]
+===== High sequence identity =====
+Mafft:
+[[Image:MSUD_BCKDHB_high_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_BCKDHB_high_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_BCKDHB_high_seq_ident_tcoffee.png]]
+Espresso:
+[[Image:MSUD_BCKDHB_high_seq_ident_espresso.png]]
+===== Whole range sequence identity =====
+Mafft:
+[[Image:MSUD_BCKDHB_whole_range_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_BCKDHB_whole_range_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_BCKDHB_whole_range_seq_ident_tcoffee.png]]
+Espresso:
+[[Image:MSUD_BCKDHB_whole_range_seq_ident_espresso.png]]
+==== DBT ====
+===== Low sequence identity =====
+Mafft:
+[[Image:MSUD_DBT_low_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_DBT_low_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_DBT_low_seq_ident_tcoffee.png]]
+===== High sequence identity =====
+Mafft:
+[[Image:MSUD_DBT_high_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_DBT_high_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_DBT_high_seq_ident_tcoffee.png]]
+===== Whole range sequence identity =====
+Mafft:
+[[Image:MSUD_DBT_whole_range_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_DBT_whole_range_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_DBT_whole_range_seq_ident_tcoffee.png]]
+==== DLD ====
+===== Low sequence identity =====
+Mafft:
+[[Image:MSUD_DLD_low_seq_ident_mafft.png|18716px]]
+Muscle:
+[[Image:MSUD_DLD_low_seq_ident_muscle.png|18455px]]
+T-Coffee:
+[[Image:MSUD_DLD_low_seq_ident_tcoffee.png|18644px]]
+===== High sequence identity =====
+Mafft:
+[[Image:MSUD_DLD_high_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_DLD_high_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_DLD_high_seq_ident_tcoffee.png]]
+===== Whole range sequence identity =====
+Mafft:
+[[Image:MSUD_DLD_whole_range_seq_ident_mafft.png]]
+Muscle:
+[[Image:MSUD_DLD_whole_range_seq_ident_muscle.png]]
+T-Coffee:
+[[Image:MSUD_DLD_whole_range_seq_ident_tcoffee.png]]

Difference between revisions of "Task 2 (MUSD) Additional Results"

Latest revision as of 14:04, 27 August 2013

Contents

Sequence searches

Results

BCKDHB

DBT

DLD

Discussion

Multiple sequence alignments

Results

BCKDHB

Low sequence identity

High sequence identity

Whole range sequence identity

DBT

Low sequence identity

High sequence identity

Whole range sequence identity

DLD

Low sequence identity

High sequence identity

Whole range sequence identity

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools