Difference between revisions of "Task alignments"

From Bioinformatikpedia
m (Multiple sequence alignments)
(Sequence searches)
 
(24 intermediate revisions by 3 users not shown)
Line 5: Line 5:
 
* pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
 
* pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
 
* multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)
 
* multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)
with special attention to advantages and limitations of theses methods.
+
with special attention to advantages and limitations of theses methods. <br>
  +
  +
You may want to look in my (Arianes) talk again. Here is the [https://dl.dropboxusercontent.com/u/9441182/SequenceAlignmentPresentation.pdf link].
   
 
== Where to run the analyses ==
 
== Where to run the analyses ==
 
* You can run the analyses on your own computers.
 
* You can run the analyses on your own computers.
 
* You can also use the student computer pool: <code>i12k-biolab0n.informatik.tu-muenchen.de</code>, where ''n'' goes from 1 to 9 (or more?). The file server does not have blast etc. installed!
 
* You can also use the student computer pool: <code>i12k-biolab0n.informatik.tu-muenchen.de</code>, where ''n'' goes from 1 to 9 (or more?). The file server does not have blast etc. installed!
  +
'''DATA updating finished (April 30)''' Tim updated the data in <code>/mnt/project/rost_db/data/</code> for big and hhblits. The process finished around 5 pm on April 30th. [[User:Andrea|andrea]]
  +
  +
'''Please note:''' check the page [[Resource software]] for updates about shared scripts ;)
   
 
== Sequence searches ==
 
== Sequence searches ==
For every native protein sequence for every disease employ different tools for database searching and multiple sequence alignment in the "big80" database.
+
For every native protein sequence for every disease employ different tools for database searching:
 
The methods to employ (minimally) are:
 
The methods to employ (minimally) are:
 
* Searches of the non-redundant sequence database big_80:
 
* Searches of the non-redundant sequence database big_80:
Line 22: Line 27:
 
*** E-value cutoff 10E-10
 
*** E-value cutoff 10E-10
 
** HHblits (HHsearch) using standard parameters, since there is no big_80 for HHblits, search against Uniprot
 
** HHblits (HHsearch) using standard parameters, since there is no big_80 for HHblits, search against Uniprot
* Data can be found in (to be found in <code>/mnt/project/pracstrucfunc12/data/</code> or <code>/mnt/project/rost_db/data/</code>
+
* Data can be found in <code>/mnt/project/pracstrucfunc13/data/</code> (not updated!) or <code>/mnt/project/rost_db/data/</code>
  +
* big includes all of Uniprot and pdb. So a Blast on big will also give you pdb. In addition, Edda put a blastable pdb database in <code>/mnt/home/rost/kloppmann/data/blast_db/</code>
   
  +
'''Note:''' Save intermediate files, e.g. a3m and hhm for the alignments and HMMs generated by HHblits, or checkpoint files for PSI-Blast. We will reuse this later.
'''Note:'''
 
* Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits (Think about why we ask you to do this.)
 
*'' If your PSI-Blast search hits the limit (and your blast search didn't), also increase the number of reported hits!'' [[User:Andrea|andrea]] 10:31, 3 May 2012 (UTC)
 
'''CAVE''': If your data set gets large, the PSI-Blast searches will take a while.
 
* Save intermediate files, e.g. a3m and hhm for the alignments and HMMs generated by HHblits. We will reuse this later.
 
   
  +
For evaluating the differences of the search methods:
 
  +
For '''evaluating the differences of the search methods''':
 
* compare the result lists (e.g. how much overlap, distribution of %identity and E-values)
 
* compare the result lists (e.g. how much overlap, distribution of %identity and E-values)
* validate the result lists -- e.g.
+
* validate the result lists -- e.g. (you do no need to do all!)
** using CATH, SCOP or COPS (<code>/mnt/project/pracstrucfunc12/data/COPS/</code>) to check whether found pdb entries fall into the same fold class
+
** using CATH, SCOP or COPS (<code>/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt</code>) to check whether found pdb entries fall into the same fold class
 
** using GO to check whether sequences have common GO classifications
 
** using GO to check whether sequences have common GO classifications
 
** any other ideas how you could validate that the hits are really related?
 
** any other ideas how you could validate that the hits are really related?
   
  +
'''Note:''' Make sure that your result lists are comparable. There are a few catches:
 
  +
'''Note:''' When we do comparisons, the data needs to be comparable. Therefore:
  +
* Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits (Think about why we ask you to do this.)
  +
*And of course: If your PSI-Blast and HHblits searches hit the limit (and your blast search didn't), also increase the number of reported hits!
  +
'''CAVE''': If your data set gets large, the PSI-Blast searches will take a while.
  +
  +
'''Note:''' There are a few catches that arise from the differences in how the tools operate:
 
* HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members.
 
* HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members.
 
** If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80.
 
** If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80.
 
** If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80.
 
** If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80.
** Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search. -- Think about why we don't ask you to start out with a search against big.)
+
* Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search. -- Think about why we don't ask you to start out with a search against big.)
** To get all hits in pdb (not just clustered hits), you can also use the pdb_full database.
 
 
* big_80 is generated with [http://weizhong-lab.ucsd.edu/cd-hit/ CD-HIT], which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low. Likewise, the Uniprot database for hhblits does not contain pdb structures. So, if you want to do the quality check using structure data, come up with a way to generate comparable results.
 
* big_80 is generated with [http://weizhong-lab.ucsd.edu/cd-hit/ CD-HIT], which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low. Likewise, the Uniprot database for hhblits does not contain pdb structures. So, if you want to do the quality check using structure data, come up with a way to generate comparable results.
  +
* To get all hits in pdb with HHblits (not just clustered hits), you can also use the pdb_full database.
   
 
== Multiple sequence alignments==
 
== Multiple sequence alignments==
For calculating multiple sequence alignments, create a dataset of 20 sequences from the database search.
+
For calculating multiple sequence alignments, create a dataset of diverse sequences.
  +
Generate groups of 10 (20 for the third group) sequences where
Ideally this dataset would include 5 sequences each from these ranges:
 
* > 90% sequence identity
+
* one contains only sequences with low sequence identity (<30%) (also low mutual similarity!)
* 89 - 60% sequence identity
 
* 59 - 40% sequence identity
 
* < 40% sequence identity
 
Ideally there should be at least one pdb-structure in each range. -- This will only be possible in rare cases!
 
 
But generate at least three groups of 10 sequences where
 
* one contains only sequences with low sequence identity (<40%)
 
 
* one contains only sequences with high sequence identity (>60%)
 
* one contains only sequences with high sequence identity (>60%)
 
* one contains sequences covering the whole range of sequence identity.
 
* one contains sequences covering the whole range of sequence identity.
  +
Ideally there should be at least two sequences with pdb-structures in each group. You can use the structure of your target sequence as a second structure for 3D-Coffee.
   
 
The alignment methods to use on each of these groups are:
 
The alignment methods to use on each of these groups are:
Line 64: Line 67:
 
* T-Coffee with
 
* T-Coffee with
 
** default parameters ("t_coffee your_sequences.fasta)
 
** default parameters ("t_coffee your_sequences.fasta)
** use of 3D-Coffee
+
** use of 3D-Coffee
  +
'''Note''':
'''Note''': ClustalW should be on your path on the student machines, there is a version of T-Coffee on <code>/mnt/opt/T-Coffee/bin/</code>. If you include that in your path, you also have muscle.
 
  +
* ClustalW should be on your path on the student machines, there is a version of T-Coffee on <code>/mnt/opt/T-Coffee/bin/</code>. If you include that in your path, you also have muscle.
  +
* You may also use MAFFT if you like. Probably you should then leave out one of the other methods. ;-)
   
   
Compare your alignments (qualitatively). Things to look for are:
+
Compare your alignments (qualitatively! You do not need to run statistics!). Things to look for are:
 
* How many conserved columns?
 
* How many conserved columns?
 
* How many gaps?
 
* How many gaps?
Line 74: Line 79:
 
* Are there gaps in secondary structure elements?
 
* Are there gaps in secondary structure elements?
 
* Where do functionally important residues stand out most?
 
* Where do functionally important residues stand out most?
  +
   
 
Points for discussion:
 
Points for discussion:
 
* Observe how the sequence identity in the groups of sequences influences the alignments.
 
* Observe how the sequence identity in the groups of sequences influences the alignments.
* Do all methods cope with low similarity?
+
** Do all methods cope with low similarity?
  +
** Are residues that are aligned in the high similarity group still aligned when the low similarity sequences are added?
 
* Does the incorporation of structural information (3D Coffee) help?
 
* Does the incorporation of structural information (3D Coffee) help?
  +
** Does it make a difference how many structures you include?
 
* Overall, what would be your criteria for a good alignment?
 
* Overall, what would be your criteria for a good alignment?
 
* Based on your experience, which method would you like to use in the future?
 
* Based on your experience, which method would you like to use in the future?

Latest revision as of 13:04, 5 May 2013

Most prediction methods are based on comparisons to related proteins. Therefore, the search for related sequences and the alignment to other proteins is a prerequisite for most of the analyses in this practical. Hence we will investigate the recall and alignment quality of different alignment methods.

Theoretical background talks

The introductory talks should given an overview of

  • pairwise alignments and high-throuput profile searches (e.g. Fasta, Blast, PSI-Blast, HHsearch)
  • multiple alignments (e.g. ClustalW, Probcons, Mafft, Muscle, T-Coffee, Cobalt) and MSA editors (e.g. Jalview)

with special attention to advantages and limitations of theses methods.

You may want to look in my (Arianes) talk again. Here is the link.

Where to run the analyses

  • You can run the analyses on your own computers.
  • You can also use the student computer pool: i12k-biolab0n.informatik.tu-muenchen.de, where n goes from 1 to 9 (or more?). The file server does not have blast etc. installed!

DATA updating finished (April 30) Tim updated the data in /mnt/project/rost_db/data/ for big and hhblits. The process finished around 5 pm on April 30th. andrea

Please note: check the page Resource software for updates about shared scripts ;)

Sequence searches

For every native protein sequence for every disease employ different tools for database searching: The methods to employ (minimally) are:

  • Searches of the non-redundant sequence database big_80:
    • Blast
    • PSI-Blast using standard parameters with all combinations of
      • 2 iterations
      • 10 iterations
      • default E-value cutoff (0.002)
      • E-value cutoff 10E-10
    • HHblits (HHsearch) using standard parameters, since there is no big_80 for HHblits, search against Uniprot
  • Data can be found in /mnt/project/pracstrucfunc13/data/ (not updated!) or /mnt/project/rost_db/data/
  • big includes all of Uniprot and pdb. So a Blast on big will also give you pdb. In addition, Edda put a blastable pdb database in /mnt/home/rost/kloppmann/data/blast_db/

Note: Save intermediate files, e.g. a3m and hhm for the alignments and HMMs generated by HHblits, or checkpoint files for PSI-Blast. We will reuse this later.


For evaluating the differences of the search methods:

  • compare the result lists (e.g. how much overlap, distribution of %identity and E-values)
  • validate the result lists -- e.g. (you do no need to do all!)
    • using CATH, SCOP or COPS (/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt) to check whether found pdb entries fall into the same fold class
    • using GO to check whether sequences have common GO classifications
    • any other ideas how you could validate that the hits are really related?


Note: When we do comparisons, the data needs to be comparable. Therefore:

  • Check the outcome of your simple blast search. If there are many significant hits, increase the number of reported hits (-v, -b or max_target_seqs depending on blast version and output format) until no more relevant hits are found. Use that parameter also for the PSI-Blast searches and use a similar setting for HHblits (Think about why we ask you to do this.)
  • And of course: If your PSI-Blast and HHblits searches hit the limit (and your blast search didn't), also increase the number of reported hits!

CAVE: If your data set gets large, the PSI-Blast searches will take a while.

Note: There are a few catches that arise from the differences in how the tools operate:

  • HHblits searches against the clustered Uniprot version. In the output the cluster representatives are listed together with the cluster members.
    • If you compare the representatives against a PSI-Blast result for big_80, you will get more hits for big_80.
    • If you compare the representatives plus the cluster members against big_80, you will get fewer hits for big_80.
  • Come up with a way to generate comparable results. (There is also a complete database "big" which you can use for searching -- reusing the profiles from your big_80 search. -- Think about why we don't ask you to start out with a search against big.)
  • big_80 is generated with CD-HIT, which prefers long sequences over shorter ones. Hence the number of pdb hits in your big_80 search is going to be low. Likewise, the Uniprot database for hhblits does not contain pdb structures. So, if you want to do the quality check using structure data, come up with a way to generate comparable results.
  • To get all hits in pdb with HHblits (not just clustered hits), you can also use the pdb_full database.

Multiple sequence alignments

For calculating multiple sequence alignments, create a dataset of diverse sequences. Generate groups of 10 (20 for the third group) sequences where

  • one contains only sequences with low sequence identity (<30%) (also low mutual similarity!)
  • one contains only sequences with high sequence identity (>60%)
  • one contains sequences covering the whole range of sequence identity.

Ideally there should be at least two sequences with pdb-structures in each group. You can use the structure of your target sequence as a second structure for 3D-Coffee.

The alignment methods to use on each of these groups are:

  • ClustalW
  • Muscle
  • T-Coffee with
    • default parameters ("t_coffee your_sequences.fasta)
    • use of 3D-Coffee

Note:

  • ClustalW should be on your path on the student machines, there is a version of T-Coffee on /mnt/opt/T-Coffee/bin/. If you include that in your path, you also have muscle.
  • You may also use MAFFT if you like. Probably you should then leave out one of the other methods. ;-)


Compare your alignments (qualitatively! You do not need to run statistics!). Things to look for are:

  • How many conserved columns?
  • How many gaps?
  • Are functionally important residues conserved?
  • Are there gaps in secondary structure elements?
  • Where do functionally important residues stand out most?


Points for discussion:

  • Observe how the sequence identity in the groups of sequences influences the alignments.
    • Do all methods cope with low similarity?
    • Are residues that are aligned in the high similarity group still aligned when the low similarity sequences are added?
  • Does the incorporation of structural information (3D Coffee) help?
    • Does it make a difference how many structures you include?
  • Overall, what would be your criteria for a good alignment?
  • Based on your experience, which method would you like to use in the future?