Difference between revisions of "Task alignments 2012"

From Bioinformatikpedia
m (Sequence searches)
m (Sequence searches)
Line 10: Line 10:
 
Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment in the "big80" database.
 
Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment in the "big80" database.
 
The methods to employ (minimally) are:
 
The methods to employ (minimally) are:
* Searches of the non-redundant sequence database big80:
+
* Searches of the non-redundant sequence database big_80 (to be found in /mnt/project/pracstrucfunc12/data/big/):
 
** Blast
 
** Blast
 
** PSI-Blast using standard parameters with all combinations of
 
** PSI-Blast using standard parameters with all combinations of
Line 17: Line 17:
 
*** default E-value cutoff (0.002)
 
*** default E-value cutoff (0.002)
 
*** E-value cutoff 10E-10
 
*** E-value cutoff 10E-10
  +
** HHblits / HHsearch using standard parameters, since there is no big_80 for HHblits, search against Uniprot
** HHblits / HHsearch
 
   
'''Note:''' Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits / HHsearch. (Think about why we ask you to do this.)
+
'''Note:''' Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits / HHsearch. (Think about why we ask you to do this.)
   
'''CAVE''': If your data set gets large, the PSI-Blast searches will take quite a while.
+
'''CAVE''': If your data set gets large, the PSI-Blast searches will take a while.
   
 
For evaluating the differences of the search methods:
 
For evaluating the differences of the search methods:
* compare the result lists (how much overlap, distribution of %identity and score)
+
* compare the result lists (e.g. how much overlap, distribution of %identity and E-values)
 
* validate the result lists -- e.g.
 
* validate the result lists -- e.g.
** using COPS to check whether found pdb entries fall into the same fold class
+
** using COPS (/mnt/project/pracstrucfunc12/data/COPS/) to check whether found pdb entries fall into the same fold class
 
** using GO to check whether sequences have common GO classifications
 
** using GO to check whether sequences have common GO classifications
  +
  +
'''Note:''' Make sure that your result lists are comparable. There are a few catches:
  +
* HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members.
  +
** If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80.
  +
** If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80.
  +
** Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search.
  +
* big_80 is generated with CDhit, which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low.
   
 
== Multiple sequence alignments==
 
== Multiple sequence alignments==

Revision as of 12:17, 19 April 2012

Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods.

Theoretical background talks

The introductory talks should given an overview of

  • pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
  • multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)

with special attention to advantages and limitations of theses methods.

Sequence searches

Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment in the "big80" database. The methods to employ (minimally) are:

  • Searches of the non-redundant sequence database big_80 (to be found in /mnt/project/pracstrucfunc12/data/big/):
    • Blast
    • PSI-Blast using standard parameters with all combinations of
      • 2 iterations
      • 10 iterations
      • default E-value cutoff (0.002)
      • E-value cutoff 10E-10
    • HHblits / HHsearch using standard parameters, since there is no big_80 for HHblits, search against Uniprot

Note: Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits / HHsearch. (Think about why we ask you to do this.)

CAVE: If your data set gets large, the PSI-Blast searches will take a while.

For evaluating the differences of the search methods:

  • compare the result lists (e.g. how much overlap, distribution of %identity and E-values)
  • validate the result lists -- e.g.
    • using COPS (/mnt/project/pracstrucfunc12/data/COPS/) to check whether found pdb entries fall into the same fold class
    • using GO to check whether sequences have common GO classifications

Note: Make sure that your result lists are comparable. There are a few catches:

  • HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members.
    • If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80.
    • If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80.
    • Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search.
  • big_80 is generated with CDhit, which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low.

Multiple sequence alignments

Multiple sequence alignments of 20 sequences from the database search, including sequences from these ranges:

  • 99 - 90% sequence identity
  • 89 - 60% sequence identity
  • 59 - 40% sequence identity
  • 39 - 20% sequence identity

Ideally there should be 5 sequences from each range with at least one pdb-structure in each range. -- This will only be possible in rare cases! -- But generate at least three groups of sequences where

  • one contains only sequences with low sequence identity (<40%)
  • one contains only sequences with high sequence identity (>60%)
  • one contains sequences with low and high sequence identity.

The alignment methods to use on each of these groups are:

  • ClustalW
  • Muscle
  • T-Coffee with
    • default parameters ("t_coffee your_sequences.fasta)
    • use of 3D-Coffee

Compare your alignments (qualitatively). Things to look for are:

  • How many conserved columns?
  • How many gaps?
  • Are functionally important residues conserved?
  • Are there gaps in secondary structure elements?

Points for discussion:

  • Observe how the sequence identity in the groups of sequences influences the alignments.
  • Do all methods cope with low similarity?
  • Does the incorporation of structural information (3D Coffee) help?