Difference between revisions of "Task alignments 2011"

From Bioinformatikpedia
m
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods.
 
Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods.
   
  +
== Theoretical background talks ==
The introductory talks shall give an overview of
 
  +
The introductory talks gave an overview of
 
* pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
 
* pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
 
* multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)
 
* multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)
 
with special attention to advantages and limitations of theses methods.
 
with special attention to advantages and limitations of theses methods.
   
  +
[http://www.scribd.com/doc/56075475/Sequence-Alignments Slides of the talk] (or without Scribd: [[Media:SequenceAlignmentsTalk.pdf|direct access to pdf]])
  +
  +
In the end, there was a short summary presentation of the take home message and the task description (see [[Media:SequenceSearchesAlignments.pdf|pdf of the slides]] ).
  +
  +
== Sequence searches ==
 
Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment.
 
Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment.
 
The methods to employ (minimally) are:
 
The methods to employ (minimally) are:
Line 15: Line 21:
 
*** 5 iterations
 
*** 5 iterations
 
*** default E-value cutoff (0.005)
 
*** default E-value cutoff (0.005)
*** E-value cutoff 10E-6;
+
*** E-value cutoff 10E-6
  +
** HHsearch -- '''''update:''''' As has some of you have discovered, HHsearch is not applicable for the NR database (if you don't understand why, think about how it works, then ask someone... ;-) ) -- Therefore, choose a recent version of the HMMs calculated for the pdb database to search in. (You will have to download the database of HMMs to run this on the command line -- or refer to the online version -> HHpred for this query.)
** HHsearch
 
   
 
For evaluating the differences of the search methods:
 
For evaluating the differences of the search methods:
Line 22: Line 28:
 
* check HSSP for "true positives" and see how many of these are found (-> recall)
 
* check HSSP for "true positives" and see how many of these are found (-> recall)
   
* Multiple sequence alignments of 20 sequences from your database search, including sequences from these ranges:
+
== Multiple sequence alignments==
  +
Multiple sequence alignments of 20 sequences from the database search, including sequences from these ranges:
** 99 - 90% sequence identity
 
** 89 - 60% sequence identity
+
* 99 - 90% sequence identity
** 59 - 40% sequence identity
+
* 89 - 60% sequence identity
** 39 - 20% sequence identity
+
* 59 - 40% sequence identity
  +
* 39 - 20% sequence identity
 
Ideally there should be 5 sequences from each range with at least one pdb-structure in each range.
 
Ideally there should be 5 sequences from each range with at least one pdb-structure in each range.
  +
 
The alignment methods to use are:
 
The alignment methods to use are:
** Cobalt
+
* Cobalt
** ClustalW
+
* ClustalW
** Muscle
+
* Muscle
** T-Coffee with
+
* T-Coffee with
*** default parameters ("t_coffee your_sequences.fasta)
+
** default parameters ("t_coffee your_sequences.fasta)
*** use of 3D-Coffee
+
** use of 3D-Coffee
   
Subsequently compare the alignments:
+
Comparison of the alignments:
 
* How many conserved columns?
 
* How many conserved columns?
 
* Are functionally important residues conserved?
 
* Are functionally important residues conserved?

Latest revision as of 13:04, 29 March 2012

Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods.

Theoretical background talks

The introductory talks gave an overview of

  • pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
  • multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)

with special attention to advantages and limitations of theses methods.

Slides of the talk (or without Scribd: direct access to pdf)

In the end, there was a short summary presentation of the take home message and the task description (see pdf of the slides ).

Sequence searches

Subsequently, for every native protein sequence for every disease the students shall employ different tools for database searching and multiple sequence alignment. The methods to employ (minimally) are:

  • Searches of the non-redundant sequence database:
    • Fasta
    • Blast
    • PSI-Blast using standard parameters with all combinations of
      • 3 iterations
      • 5 iterations
      • default E-value cutoff (0.005)
      • E-value cutoff 10E-6
    • HHsearch -- update: As has some of you have discovered, HHsearch is not applicable for the NR database (if you don't understand why, think about how it works, then ask someone... ;-) ) -- Therefore, choose a recent version of the HMMs calculated for the pdb database to search in. (You will have to download the database of HMMs to run this on the command line -- or refer to the online version -> HHpred for this query.)

For evaluating the differences of the search methods:

  • compare the result lists (how much overlap, distribution of %identity and score)
  • check HSSP for "true positives" and see how many of these are found (-> recall)

Multiple sequence alignments

Multiple sequence alignments of 20 sequences from the database search, including sequences from these ranges:

  • 99 - 90% sequence identity
  • 89 - 60% sequence identity
  • 59 - 40% sequence identity
  • 39 - 20% sequence identity

Ideally there should be 5 sequences from each range with at least one pdb-structure in each range.

The alignment methods to use are:

  • Cobalt
  • ClustalW
  • Muscle
  • T-Coffee with
    • default parameters ("t_coffee your_sequences.fasta)
    • use of 3D-Coffee

Comparison of the alignments:

  • How many conserved columns?
  • Are functionally important residues conserved?
  • How many gaps?
  • Are there gaps in secondary structure elements?