Difference between revisions of "Task 3 - Sequence-based predictions"

From Bioinformatikpedia
(Signal peptides)
 
(109 intermediate revisions by 4 users not shown)
Line 1: Line 1:
In contrast to the vast amount of known protein sequences, information about structure and function is available for only very few proteins. Sequence-based predictions of protein features aim to decrease this gap. Many sequence-based preditiction methods use evolutionary information. Sequence alignments are therefore often a prerequisite for the predictions.
+
In contrast to the vast amount of known protein sequences, information about structure and function is available for only very few proteins (difference between TrEMBL and SwissProt and the Protein Data Bank). Sequence-based predictions of protein features aim to decrease this gap. Many sequence-based prediction methods use evolutionary information. Sequence alignments are therefore often a prerequisite for the predictions.
   
  +
== Theoretical background talks ==
 
  +
The talks will give an introduction to sequence-based protein predictions. In particular:
 
  +
== Theoretical background talk ==
  +
The talk will give an introduction to sequence-based protein predictions. In particular:
 
* secondary structure
 
* secondary structure
 
* disorder
 
* disorder
 
* transmembrane helices
 
* transmembrane helices
  +
* signal peptides
 
* GO terms
 
* GO terms
  +
  +
The slides of the talk can be found here: [https://www.dropbox.com/s/sgkulia7xhd9ewn/Keynote_SeqBasedPrediciton_v3_finished_stages.pdf] (Sorry for the large file ;))
   
 
== Where to run the jobs ==
 
== Where to run the jobs ==
* You can log in to the student computer pool: <code>i12k-biolab??.informatik.tu-muenchen.de</code>, where <code>??</code> goes from 01 to 10.
+
* You can log in to the student computer pool from outside: <code>i12k-biolab??.informatik.tu-muenchen.de</code>, where <code>??</code> goes from 01 to 10.
 
* Work in the student computer pool.
 
* Work in the student computer pool.
 
* You can also install the programs on your own computer.
 
* You can also install the programs on your own computer.
  +
  +
  +
  +
== Protein sequence databases ==
  +
  +
Up-to-date
  +
  +
/mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot (BLAST db, May 7, 2013)
  +
  +
/mnt/project/pracstrucfunc13/data/pdb/pdb_seqres (BLAST db, May 7, 2013)
  +
  +
  +
  +
== Questions to answer ==
  +
* What features are predicted?
  +
* Discuss the results for your protein and the example proteins. Using the predictions, what could you learn about your protein and the example proteins? Compare to the available knowledge in UniProt, PDB, DisProt, OPM, PDBTM, Pfam...
  +
* Look for other methods to get an idea how many different tools are available to predict: secondary structure, disorder, transmembrane, signal peptides and GO terms. You should be able to name several more methods in the discussion. (You can also try out more methods.)
  +
* What else can/is be predicted from protein sequence alone?
  +
* Which predictions can be improved considerably by structure-based approaches?
  +
  +
   
 
== Secondary structure ==
 
== Secondary structure ==
Use ReProf (available as Debian package on rostlab.org) to predict secondary structure for your protein. Apply ReProf also to these proteins (given are UniProt IDs):
+
Use ReProf to predict secondary structure for your protein and also for these proteins (UniProt AC):
 
* P10775
 
* P10775
 
* Q9X0E6
 
* Q9X0E6
 
* Q08209
 
* Q08209
   
  +
ReProf is installed on the student computers. Usage: <code>reprof</code> (<code>man reprof</code>). Use as input a fasta sequence as well as a position specific matrix (PSSM). The PSSM can be genereated with PSI-BLAST or HHblits. ReProf uses the PSI-BLAST PSSM format, i.e. HHblits output will have to be transformed. As sequence database try big_80 and SwissProt. Use all combinations (protein sequence vs. PSSM, PSSM from SwissProt vs. big_80) only for one of the example proteins. Then decide on one approach and apply to the other proteins.
Use fasta sequences for the prediction. You can find out about Reprof usage by running <code>reprof</code> or reading the man page (<code>man reprof</code>). Peter Hoenigschmig (<code>hoenigschmid@rostlab.org</code>) would like to hear about anything that would improve the description or if anything seems unclear. For help, you can always ask us first.
 
  +
  +
Compare the ReProf results to [http://bioinf.cs.ucl.ac.uk/psipred/ PsiPred] and [http://mrs.cmbi.ru.nl/hsspsoap/ DSSP_server] ([http://swift.cmbi.ru.nl/gv/dssp/ DSSP]). Find out more about the example proteins (and yours) using [http://www.uniprot.org UniProt] and the [http://www.pdb.org PDB].
  +
   
Compare the ReProf results to [http://bioinf.cs.ucl.ac.uk/psipred/ PsiPred] and [ http://mrs.cmbi.ru.nl/hsspsoap/ DSSP_server] ([http://swift.cmbi.ru.nl/gv/dssp/ DSSP]). Before you use DSSP, find out more about the example proteins (and yours) using [http://www.uniprot.org UniProt] and the [http://www.pdb.org PDB].
 
   
 
== Disorder ==
 
== Disorder ==
Use IUPred to predict disorder for your protein. Apply IUPred to the example proteins given above, too (run <code>iupred</code>). You can find a README here: <code>/opt/iupred/README</code>.
+
Use IUPred and MD (MetaDisorder) to predict disorder for your protein and for the example proteins given above.
  +
  +
'''IUPred'''
  +
  +
IUPred is installed on the student computers. Usage: <code>iupred</code>. Try the different types: "long", "short" and "glob". You can find more information here: <code>/opt/iupred/</code> (README and example). Try the [http://iupred.enzim.hu/ IUPred server] and have a look at the graphical output.
  +
  +
  +
'''MD (MetaDisorder)'''
  +
  +
MD is a meta-predictor that combines several prediction methods:
  +
* NORSnet: disorder=loops
  +
* PROFbval: residue flexibility, equivalent to B-values from X-ray crystallography
  +
* Ucon: prediction of protein disorder using predicted internal contacts
  +
MD is part of predict protein, a rostlab resource. You can run MD, for example, like this:
  +
  +
<code> predictprotein --seqfile <seq.fasta> --target metadisorder -p <name of output files> -o <output-dir></code>
  +
  +
Check the man pages for predict protein and metadisorder. Output.metadisorder gives among others, the final MD decision on disorder. MDrel=reliability of the prediction by MD. Values range from 0-9, 9=strong prediction.
  +
  +
  +
Compare the results to the information in the [http://www.disprot.org/ DisProt] database. ''Note:'' DisProt does not cover the complete UniProt list of sequences. If your protein is not in DisProt, you can use "Search by sequence" in DisProt to look for similar proteins. Look for at least one other (interesting!) disordered protein, use for example DisProt.
  +
   
Compare the results to the [http://www.disprot.org/ DisProt] database.
 
   
 
== Transmembrane helices ==
 
== Transmembrane helices ==
Use PolyPhobius to predict transmembrane helices for your protein and for the follwoing proteins:
+
Use PolyPhobius and MEMSAT-SVM to predict transmembrane helices for your protein and for the following proteins (UniProt IDs given):
 
* P35462
 
* P35462
  +
* Q9YDF8
  +
* P47863
  +
  +
  +
'''PolyPhobius'''
  +
  +
In contrast to its precursor Phobius, PolyPhobius uses homology information for the prediction. You input a multiple sequence alignment (MSA) in Kalign format instead of a fasta sequence. You can use any MSA, for example one generated during task 2. PolyPhobius provides scripts, too.
  +
  +
PolyPhobius is installed in <code>/mnt/project/pracstrucfunc13/polyphobius/</code>
  +
* perl script <code>blastget</code> (<code>/mnt/project/pracstrucfunc12/polyphobius/blastget</code>) to run BLAST. You can use any sequence database, for example big_80. However, we recommend SwissProt. It is fast and the results do usually not improve when using a larger database. You need an index of your BLAST db for <code>blastget</code>. You can use this index: <code>/mnt/project/pracstrucfunc13/data/index_pp/uniprot_sprot.idx</code>, which relates to the SwissProt BLAST db here: <code>/mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot</code>, but try to generate your own (see <code>blastget -h</code>).
  +
* Kalign (<code>/mnt/opt/T-Coffee/bin/kalign</code>) to generate the MSA.
  +
* (<code>/mnt/project/pracstrucfunc13/polyphobius/jphobius</code>) with MSA as input. Do not forget the -poly parameter.
  +
Try the [http://phobius.sbc.su.se/poly.html PolyPhobius server] and have a look at the graphical output.
  +
  +
'''MEMSAT-SVM'''
  +
  +
MEMSAT-SVM uses support vector machines (SVMs) to predict transmembrane helices. In the talk, MEMSAT-3 was introduced: same group, HMMs. We are using MEMSAT-SVM, since it performed better in an evaluation than MEMSAT-3. However, it is also very slow. Use the [http://bioinf.cs.ucl.ac.uk/psipred/?memsatsvm=1 MEMSAT server]. MEMSAT-SVM is not running on the biolab. Maybe soon.
  +
/mnt/project/pracstrucfunc13/memsat-svm/
  +
  +
Compare the results to the membrane assignment of the structures for these proteins in [http://opm.phar.umich.edu/ OPM] and [http://pdbtm.enzim.hu/ PDBTM].
  +
  +
== Signal peptides ==
  +
Use SignalP to predict signal peptides for the following proteins:
  +
* P02768
  +
* P47863
  +
* P11279
  +
  +
You can look for more example proteins with different signal peptides or targeting signals.
   
  +
You can run <code>signalp</code> in the student computer pool. This is version 3.0 right now. However, we are trying to update to version 4.1. Please note which version you are using. You can also use the [http://www.cbs.dtu.dk/services/SignalP/ SignalP] server, version 4.1. For one of the example proteins I got a different prediction from SignalP 3.0 and 4.1.
PolyPhobius is installed in <code>/mnt/project/pracstrucfunc12/polyphobius</code>.
 
  +
<code>signalp4</code>
   
  +
You can use the [http://www.signalpeptide.de/index.php Signal Peptide Website] to look up the proteins. Check also for predicted transmembrane helices.
In contrast to its precursor Phobius, PolyPhobius uses homology information for the prediction. First, you have to execute a blast search. PolyPhobius distributed its own perl script for this purpose: <code>blastget</code> (<code>/mnt/project/pracstrucfunc12/polyphobius/blastget</code>). Usage: <code>blastget -h</code>. Use only the -db and -ix parameters. Input is the fasta sequence of the above given proteins. Use SwissProt (<code>/mnt/project/pracstrucfunc12/data/swissprot/uniprot_sprot</code>) as database and <code>/mnt/project/pracstrucfunc12/data/index_pp/uniprot_sprot.idx</code> as index.
 
   
  +
== GO terms ==
Use the blastget output to create a MSA using Kalign (<code>/mnt/opt/T-Coffee/bin//kalign</code>).
 
  +
Use [http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar GOPET] and [http://www.cbs.dtu.dk/services/ProtFun/ ProtFun2.0] to predit GO terms for your protein. What can you learn about its function form sequence alone?
   
  +
Use [http://pfam.sanger.ac.uk/ Pfam] -> SEQUENCE SEARCH to find out more about the Pfam family of your protein.
   
   
  +
[[Category : Disease]][[Category : Phenylketonuria]]
Give a brief description of the theory and the algorithm, if possible
 
What is predicted? Describe the features in some detail
 
What information is required for the predictions?
 
Apply the prediction methods to your protein and explain how to do this
 
You may try out other protein sequences (this is required when specified)
 
Present, describe and discuss the results
 
Look for other methods, for example here: http://expasy.org/tools/
 
You may try out more methods
 

Latest revision as of 08:31, 28 May 2013

In contrast to the vast amount of known protein sequences, information about structure and function is available for only very few proteins (difference between TrEMBL and SwissProt and the Protein Data Bank). Sequence-based predictions of protein features aim to decrease this gap. Many sequence-based prediction methods use evolutionary information. Sequence alignments are therefore often a prerequisite for the predictions.


Theoretical background talk

The talk will give an introduction to sequence-based protein predictions. In particular:

  • secondary structure
  • disorder
  • transmembrane helices
  • signal peptides
  • GO terms

The slides of the talk can be found here: [1] (Sorry for the large file ;))

Where to run the jobs

  • You can log in to the student computer pool from outside: i12k-biolab??.informatik.tu-muenchen.de, where ?? goes from 01 to 10.
  • Work in the student computer pool.
  • You can also install the programs on your own computer.


Protein sequence databases

Up-to-date

/mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot (BLAST db, May 7, 2013)

/mnt/project/pracstrucfunc13/data/pdb/pdb_seqres (BLAST db, May 7, 2013)


Questions to answer

  • What features are predicted?
  • Discuss the results for your protein and the example proteins. Using the predictions, what could you learn about your protein and the example proteins? Compare to the available knowledge in UniProt, PDB, DisProt, OPM, PDBTM, Pfam...
  • Look for other methods to get an idea how many different tools are available to predict: secondary structure, disorder, transmembrane, signal peptides and GO terms. You should be able to name several more methods in the discussion. (You can also try out more methods.)
  • What else can/is be predicted from protein sequence alone?
  • Which predictions can be improved considerably by structure-based approaches?


Secondary structure

Use ReProf to predict secondary structure for your protein and also for these proteins (UniProt AC):

  • P10775
  • Q9X0E6
  • Q08209

ReProf is installed on the student computers. Usage: reprof (man reprof). Use as input a fasta sequence as well as a position specific matrix (PSSM). The PSSM can be genereated with PSI-BLAST or HHblits. ReProf uses the PSI-BLAST PSSM format, i.e. HHblits output will have to be transformed. As sequence database try big_80 and SwissProt. Use all combinations (protein sequence vs. PSSM, PSSM from SwissProt vs. big_80) only for one of the example proteins. Then decide on one approach and apply to the other proteins.

Compare the ReProf results to PsiPred and DSSP_server (DSSP). Find out more about the example proteins (and yours) using UniProt and the PDB.


Disorder

Use IUPred and MD (MetaDisorder) to predict disorder for your protein and for the example proteins given above.

IUPred

IUPred is installed on the student computers. Usage: iupred. Try the different types: "long", "short" and "glob". You can find more information here: /opt/iupred/ (README and example). Try the IUPred server and have a look at the graphical output.


MD (MetaDisorder)

MD is a meta-predictor that combines several prediction methods:

  • NORSnet: disorder=loops
  • PROFbval: residue flexibility, equivalent to B-values from X-ray crystallography
  • Ucon: prediction of protein disorder using predicted internal contacts

MD is part of predict protein, a rostlab resource. You can run MD, for example, like this:

predictprotein --seqfile <seq.fasta> --target metadisorder -p <name of output files> -o <output-dir>

Check the man pages for predict protein and metadisorder. Output.metadisorder gives among others, the final MD decision on disorder. MDrel=reliability of the prediction by MD. Values range from 0-9, 9=strong prediction.


Compare the results to the information in the DisProt database. Note: DisProt does not cover the complete UniProt list of sequences. If your protein is not in DisProt, you can use "Search by sequence" in DisProt to look for similar proteins. Look for at least one other (interesting!) disordered protein, use for example DisProt.


Transmembrane helices

Use PolyPhobius and MEMSAT-SVM to predict transmembrane helices for your protein and for the following proteins (UniProt IDs given):

  • P35462
  • Q9YDF8
  • P47863


PolyPhobius

In contrast to its precursor Phobius, PolyPhobius uses homology information for the prediction. You input a multiple sequence alignment (MSA) in Kalign format instead of a fasta sequence. You can use any MSA, for example one generated during task 2. PolyPhobius provides scripts, too.

PolyPhobius is installed in /mnt/project/pracstrucfunc13/polyphobius/

  • perl script blastget (/mnt/project/pracstrucfunc12/polyphobius/blastget) to run BLAST. You can use any sequence database, for example big_80. However, we recommend SwissProt. It is fast and the results do usually not improve when using a larger database. You need an index of your BLAST db for blastget. You can use this index: /mnt/project/pracstrucfunc13/data/index_pp/uniprot_sprot.idx, which relates to the SwissProt BLAST db here: /mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot, but try to generate your own (see blastget -h).
  • Kalign (/mnt/opt/T-Coffee/bin/kalign) to generate the MSA.
  • (/mnt/project/pracstrucfunc13/polyphobius/jphobius) with MSA as input. Do not forget the -poly parameter.

Try the PolyPhobius server and have a look at the graphical output.

MEMSAT-SVM

MEMSAT-SVM uses support vector machines (SVMs) to predict transmembrane helices. In the talk, MEMSAT-3 was introduced: same group, HMMs. We are using MEMSAT-SVM, since it performed better in an evaluation than MEMSAT-3. However, it is also very slow. Use the MEMSAT server. MEMSAT-SVM is not running on the biolab. Maybe soon. /mnt/project/pracstrucfunc13/memsat-svm/

Compare the results to the membrane assignment of the structures for these proteins in OPM and PDBTM.

Signal peptides

Use SignalP to predict signal peptides for the following proteins:

  • P02768
  • P47863
  • P11279

You can look for more example proteins with different signal peptides or targeting signals.

You can run signalp in the student computer pool. This is version 3.0 right now. However, we are trying to update to version 4.1. Please note which version you are using. You can also use the SignalP server, version 4.1. For one of the example proteins I got a different prediction from SignalP 3.0 and 4.1. signalp4

You can use the Signal Peptide Website to look up the proteins. Check also for predicted transmembrane helices.

GO terms

Use GOPET and ProtFun2.0 to predit GO terms for your protein. What can you learn about its function form sequence alone?

Use Pfam -> SEQUENCE SEARCH to find out more about the Pfam family of your protein.