Sequence searches
Lab journal
Results
The query sequences for the 4 subunits of BCKDC locate at /mnt/home/student/weish/master-practical-2013/task01/.
Results for sequence search locate in the directory /mnt/home/student/weish/master-practical-2013/task02/01-seq-search/results.
For BLAST and PSI-BLAST, statistics (such as E-value, probability and identity) are stored in *.tsv files. Detailed results are shown in xml files. For HHBlits, the *.hhr files contain information about statistics and hits.
BCKDHA
- E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHA)
- Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.
- Distribution of SCOP folds
Distribution of SCOP fold in BLAST hits(only one classified PDB structure was found in SCOP)
Distribution of SCOP fold in hits of PSI-BLAST(iter. 2, E-value 0.002)
- Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
BCKDHB
- E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of BCKDHB)
- Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.
- Distribution of SCOP folds
Distribution of SCOP fold in BLAST hits
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
- Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
DBT
- E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of DBT)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of DBT)
- Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.
- Distribution of SCOP folds
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 0.002). This is the only test run which have PDB hits with SCOP classification.
- Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
DLD
- E-value and identity distribution for different sequence search methods
E-value distribution of sequence search methods. (Query sequence is RefSeq of DLD)
Indentity distribution of sequence search methods. (Query sequence is RefSeq of DLD)
- Intersection of hits between different sequence search methods
Relative intersection of hits between BLAST and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 2, E-value 10e-10) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 0.002) and other sequence search methods.
Relative intersection between PSI-BLAST(iter. 10, E-value 10e-10) and other sequence search methods.
Relative intersection between HHBlits and other sequence search methods.
- Distribution of SCOP folds
Distribution of SCOP fold in BLAST hits
Distribution of SCOP fold in hits of PSI-BLAST(iter. 10, E-value 10e-10)
- Top-5 common GO terms in hits with GO annotation
Top-5 common GO terms in BLAST hits
Top-5 common GO terms in hits of PSI-BLAST(iter. 2, E-value 0.002)
Top-5 common GO terms in hits of PSI-BLAST(iter. 10, E-value 10e-10)
Discussion
- E-value distribution:
- Very few hits were found with very low E-values -> hits with high statistical significance
- Because different databases were used for BLAST/PSI-BLAST and HHBLits, hhblits has found hits with larger range of e-value -> higher density for hhblits at high E-value
- For protein BCKDHA and BCKDHB, PSI-BLAST tends to find out more hits with intermediate E-value (1e-106 to 1e-25)
- E-value distribution of PSI-BLAST shift to low E-value side with more iterations -> better search result?
- Identity distribution
- Results show that BLAST depend mostly on sequence identity -> possible lose of patterns with low sequence identity but high biological similarity
- Intersection of hits
- HHBlits was not comparable to other methods due to different sequence database
- PSI-BLAST with 2 iterations has bigger intersection with BLAST
- two PSI-BLAST run with 2 iterations and different E-value cutoffs have very similar set of hits
- PSI-BLAST with 10 iterations has less intersection with BLAST
- two PSI-BLAST run with 10 iterations and different E-value cutoffs share the fewest common hits -> E-value cutoff may have higher influence after more iterations
- SCOP of hit sequences
- PDB sequence required -> no evaluation for HHBlits
- Both BLAST and PSI-BLAST find the right fold class for query protein
- PSI-BLAST generally find more hits in the fold class that describes the query protein best (e.g. DLD protein, c.3 is FAD/NAD(P)-binding domain)
- PSI-BLAST also find hits in more fold classes which may describe domains of query protein
- Gene Ontology of hit proteins
- Top-5 GO terms in hits of PSI-BLAST with different iterations are more conserved. They also have similar ranking of frequency.
- PSI-BLAST finds out hits with more GO terms -> It may be more sensitive to functional patterns in sequence
Multiple sequence alignments
Lab journal
Results
In the following sections the MSAs, visualised with Jalview, are shown.
BCKDHA
Low sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
High sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
Whole range sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
BCKDHB
Low sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
High sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
Whole range sequence identity
Mafft:
Muscle:
T-Coffee:
Espresso:
DBT
Low sequence identity
Mafft:
Muscle:
T-Coffee:
High sequence identity
Mafft:
Muscle:
T-Coffee:
Whole range sequence identity
Mafft:
Muscle:
T-Coffee:
DLD
Low sequence identity
Mafft:
Muscle:
T-Coffee:
High sequence identity
Mafft:
Muscle:
T-Coffee:
Whole range sequence identity
Mafft:
Muscle:
T-Coffee:
Discussion
For the datasets with high sequence identity the three MSA programs Mafft, Muscle and T-Coffee come to similar results and find almost the same conserved blocks. Sometimes T-Coffee arranges gaps differently than the others and so does not find as much conserved columns. Especially at the ends of the sequences, the results of the programs differ a little. This is due to different scoring schemes that are used in the programs.
For low sequence identity, the programs have problems to find the right alignment. They do not agree in the position of gaps and also sometimes find different conserved columns. They do not cope with low similarity and so one cannot really rely on these results. Here structural information, as it is used in Espresso (which belongs to T-Coffee), can help to find the right alignment: Espresso can align more residues than T-Coffee.
For whole range sequence identity the results are similar w. r. t. many and different gaps at the ends of the sequences, but the programs agree more in the conserved columns that they find.
The results of Muscle and Mafft seem more similar to each other than to those of T-Coffee. T-Coffee often treats the ends of the sequences, which have low sequence identity, differently than the others. It is striking that almost always the alignment of Muscle has the shortest length, especially in cases with low sequence identity. If an alignment is very long, this means there are many gaps and less aligned residues, this might be a sign of bad alignment quality.
Altogether, there appear regions with many conserved columns and those with many gaps. The conserved blocks or columns correspond to secondary structure elements and functionally important residues, respectively. Gaps in the alignment appear in regions where there are loops in the structure of the protein, so that insertions or deletions that occur during evolution do not alter the overall structure or function of the protein.
As criteria for a good alignment one could run different alignment algorithms like in this task and compare the results. If one of them finds more conserved columns, this might be better than another. Different programs can be better than others if different datasets are used, so it is always a good idea to try more than one algorithm and pick out the best result. Mafft is often a good choice because it generated relatively precise results but still is very fast.