Difference between revisions of "Resource software"

From Bioinformatikpedia
(Task 3)
 
(63 intermediate revisions by 11 users not shown)
Line 1: Line 1:
 
Here, we collect descriptions of the software used in the practical. This can be software used in online portals or software installed locally on your own computers or the lab resources. In each case, please describe how to access the software and where to find manuals. Also use this site to collect scripts or HOW_TOs that could be useful for others.
 
Here, we collect descriptions of the software used in the practical. This can be software used in online portals or software installed locally on your own computers or the lab resources. In each case, please describe how to access the software and where to find manuals. Also use this site to collect scripts or HOW_TOs that could be useful for others.
   
== Mapping sequence identifiers ==
+
== Your own scripts ==
   
  +
If you have produced a script that does something that could be useful for others, please "publish" it here. E.g. create a page for your tool where you provide information where to find the software (path on the student cluster, git repository, ...) and how to use it. -- For users: If you use a script produced by another group, please document that (e.g. in the "lab book" part of your wiki section). And if you find bugs, please help the other group improve.
=== Protein sequence databases ===
 
Here are a number of suggestions how to map identifiers between [http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions Refseq] and other identifiers contained in NR and [http://www.uniprot.org/help/mapping Uniprot]. -- Be careful: Refseq identifiers have version numbers (.n). Not all mapping tools take these into account or are able to cope with the attached version number. So if you use a mapping tool, look for the documentation of the input and output.
 
* Use the mapping tool at [http://www.uniprot.org/help/mapping Uniprot] (see also [http://www.uniprot.org/faq/44 Uniprot FAQ]). Be careful to separate out the different types of identifiers (Genbank, Refseq, ...).
 
* Use [http://mips.gsf.de/genre/proj/cronos/index.html CRONOS] on the web or as a web service.
 
* Use SRS (sequence retrieval system) on the web (e.g. at [http://srs.embl.de/srs/ EMBL], [http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+srsq2+-noSession EBI], or any [http://www.biowisdom.com/download/srs-parser-and-software-downloads/public-srs-installations/ other installation]) to do a query for a list of sequences. You can also do this programatically, but then you have to have access to and learn how to use the [http://www.mendeley.com/research/sws-accessing-srs-sites-contents-through-web-services/ web services].
 
   
  +
=== Task 2 ===
   
  +
==== Executing blastp, blastpgp and hhblits ====
== Changing Blast output ==
 
  +
A script [[run.pl]] executes blastp; blastpgp and hhblits with different options: databases, number of iterations and E-value cutoffs. Also uses checkfiles for blastpgp and outputs PSSMs.
   
  +
==== Convert hhr to parseable tsv format ====
By default, Blast lists 500 search hits and 250 alignment details. This can be changed (see [http://www.ncbi.nlm.nih.gov/books/NBK1763/ Blast manual] for details):
 
  +
A C program for extraction of statistics results from hhblits output (hhr format) to a tsv (tab separated values) file:
* You can use a custom output format to get a table with "-m 8" (see "-help" or [http://bergmanlab.smith.man.ac.uk/?p=41 this hint on how to parse Blast output]).
 
  +
[https://github.com/uheeschen/hhr2tsv.git hhr2tsv.git]
* You can use "-b" to set the number of alignments to be shown, "-b 20000" is the maximum.
 
   
  +
* Install:
== Python ==
 
  +
<nowiki>
In the Linux version of the virtual machine seems to be a falsely configured python. Probably you have already noticed, that the interactive python session starts with several errors. If you import sys in this interactive session, you can regard sys.path where non-existing paths are listed.
 
  +
./configure --prefix=$HOME
  +
./make install
  +
./make clean</nowiki>
  +
* Usage:
  +
<nowiki>$HOME/bin/hhr2tsv <input_hhr_file> <output_tsv_file></nowiki>
   
  +
==== Parser of (Psi-)BLAST and HHblits hhr output files ====
By using
 
  +
A script [[parse_output.pl]] parses alignments information from (Psi-)BLAST and HHblits hhr output files into tab-separated format, suitable for plotting, calculates the number of hits and overlap of hits with same ID between (Psi-)BLAST and HHblits outputs. Moreover, there is an option to evaluate PDB hits against COPS and create files for plotting.
   
  +
==== Plotting of TPR and precision ====
<code>which python</code>
 
  +
The script [[tpr_precision.pl]] bases on the output files of [[parse_output.pl]] and it makes an R-plot of TPR and precision as a function of E-value of the hits.
   
  +
==== Script for comparing the CATH fold classes of the quer and the pdb hits ====
you can see that the falsely configured python is /apps/bin/python.
 
  +
The script [[compareCath.py]] reads the output from parse_output.pl (see above) and compares the fold classes of the query domains with the fold classes of the hits and writes a histogram to stdout.
But there should be a running version /usr/bin/python, which is a symlink to /usr/bin/python2.7.
 
   
  +
==== Script for finding GOAnnotations ====
Just delete the false one by using: <code> sudo rm /apps/bin/python </code>
 
  +
This Script finds GOAnnotations for a given Protein and creates an outfile.<br>
  +
The script can be found [https://dl.dropboxusercontent.com/u/9441182/goAnnotation.py here]<br>
  +
A typical command would be: python goAnnotation.py B2JCG3 /Desktop/result.out
   
  +
=== Task 3 ===
If you open a new console the normal user should be able to run a normal python session.
 
  +
==== Script to filter the secondary structure of reprof output files ====
  +
This script reads the output of a ReProf, a PsiPred or a DSSP run and filters for the secondary structure: [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task3/Scripts#filter_secStruc.pl filter_secStruc.pl]
   
  +
==== Calculation precision between two secondary structure sequences ====
The problem with this procedure is, that you have to reinstall missing modules (e.g. the Modeller's modules).
 
  +
This script reads two sequences in the format given by the output of filter_secStruc.pl and calculates the precision between those: [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task3/Scripts#SecStrucComparison.jar SecStrucComparison.jar]
  +
==== Automatically run Polyphobius ====
  +
This script combines <tt>blastget</tt>, <tt>kalign</tt> and <tt>jphobius</tt> together. Run polyphobius should be easier for us. You can find the code [[polyphobius.pl|here]] ([[polyphobius.pl]]).
   
== Modeller ==
+
=== Task 5 ===
  +
==== Wrapper program for modeller ====
(Written for the linux virtual machine)
 
  +
This program intends to help you to run modeller automatically. Both single template modelling and multi-template modelling are supported.
Some might notice that after the proposed Python fix (and also before) python could not load the Modeller modules. Therefore it seems to be necessary to reinstall Modeller after the python fix.
 
  +
You can find the program here: [[Modeller.py]]
   
  +
Usage:
* First you have to save/note/remember the licence key of the already installed Modeller version: <br> <code>less /apps/modeller9.9/modlib/modeller/config.py</code>
 
  +
<nowiki>Usage: Modeller.py \
  +
--template template.pdb[,template2.pdb[,template3.pdb]] \
  +
--chain chain[,chain2,[chain3]] --target target.pir \
  +
--align alignment-file.ali [--has-align]</nowiki>
   
  +
It has following parameters:
* Then you can get the Modeller package from http://salilab.org/modeller/download_installation.html <br> There should be a "Linux (64-bit x86_64 Debian/Ubuntu package)" version of the package. Download it and install it with the Synaptic Package Manager.
 
  +
* <tt>--template</tt> template protein structures (multiple structures are separated by commas)
  +
* <tt>--chain</tt> selected chains from template protein structures (multiple templates are separated by commas)
  +
* <tt>--target</tt> target sequence in PIR
  +
* <tt>--align</tt> file to store sequence alignments
  +
* <tt>--has-align</tt> OPTIONAL: whether we have already have alignment file
   
  +
=== Convert table to wiki format ===
* If you installed the new Modeller you can add the licence key to this installation in: <br> <code>/usr/lib/modeller9.9/modlib/modeller/config.py</code>
 
   
  +
R script for formatting a table (e. g. in csv format) to wiki table format: [[format_to_wiki_table.r]]
Now your Modeller should be runnable.
 
   
  +
=== Task 8 ===
   
  +
==== Calculator for residue conservation in MSA ====
  +
This [[msa-conservation.py|python script]] helps you to calculate conservation of residues in a MSA in FASTA format. First entry in the FASTA file should be the query sequence.
   
  +
Usage: <nowiki>python msa-conservation.py <MSA.fasta> residue_pos1,[residue_pos2,[...]]</nowiki>
   
  +
== Molecular visualization ==
Troubleshooting
 
   
  +
To look at protein structures you can use any molecular visualization programm. Here are a few options:
A very common error from Modeller is the following: ''' "Sequence difference between alignment and pdb" '''. This usually means the structure of the template available in PDB (which was experimentally solved) has missing residues, what could be a result of technical problems with the X-ray diffraction data (more frequently). To overcome this error, the first step should be to identify its source. For that, a fasta sequence can be generated from the PDB file containing the coordinates (it should be a simple script, or even a couple of command lines, but if any help needed, please write an email to bitar@rostlab.org). After this, align this sequence with the fasta sequence for the same protein and check for missing residues (gaps within the alignment). If residues are missing, simple remove those from the original sequence and generate a new alignment between template and target.
 
  +
* [http://www.pymol.org/ PyMOL] -> installed on the i12k-biolab computers
  +
* [http://jmol.sourceforge.net/ Jmol], e.g. via [http://www.pdb.org/ PDB]
  +
* [http://www.ks.uiuc.edu/Research/vmd/ VMD]
  +
* the [http://srs3d.org/ SRS 3D server] -- unfortunately not working any more. Maybe Aquaria will become publicly available within this practical.
   
  +
== Changing Blast output ==
   
  +
By default, Blast lists 500 search hits and 250 alignment details. This can be changed (see [http://www.ncbi.nlm.nih.gov/books/NBK1763/ Blast manual] for details):
== SNAP ==
 
  +
* You can use a custom output format to get a table with "-m 8" (see "-help" or [http://bergmanlab.smith.man.ac.uk/?p=41 this hint on how to parse Blast output]).
There is a very brief explanation about SNAP available here --> [[Media:SNAP.pdf]].
 
  +
* You can use "-b" to set the number of alignments to be shown, "-b 20000" is the maximum.
I am currently writing a script for automatic SNAP runs and structural comparison. This will only require the alignment between two sequences (in our case one would be the 'wild type' protein sequence and the other would be the SNP containing sequence). I am not sure I will finish today (23.6). Probably tomorrow. But meanwhile I have other scripts that may help, so feel free to write me (bitar@rostlab.org).
 
   
   
  +
== Modeller ==
   
  +
Troubleshooting
== Energy Minimization ==
 
I created a script to automatically run all steps (see below) for energy minimization with Gromacs. A snapshot of this script is available here [[Media:MutEn.png]] and the script itself is here [[MutEn.pl]] .
 
The script includes the following steps:
 
   
  +
A very common error from Modeller is the following: ''' "Sequence difference between alignment and pdb" '''. This usually means the structure of the template available in PDB (which was experimentally solved) has missing residues, which could be a result of technical problems with the X-ray diffraction data. Therefore, you need to make sure sure target-template alignment uses the sequence implied in the ATOM records, not the SEQRES record. To locate the error you could e.g. generate a fasta sequence can be generated from the PDB file coordinates, align this sequence with the fasta sequence for the SEQRES sequence and check for missing residues (gaps within the alignment). If residues are missing, regenerate you target template alignment based on the new fasta sequence made from the coordinates.
1. Runs SCWRL to make sure there are no missing sidechains.
 
   
  +
== SNAP ==
2. Runs repairPDB to clean the PDB and extract the protein only.
 
  +
There is a very brief explanation about SNAP available here --> [[Media:SNAP.pdf]].
   
3. Runs Gromacs packages for Energy Minimization (in the following order):
+
== Energy Minimization ==
  +
There is a script to automatically run energy minimizations with Gromacs here --> [[MutEn.pl]].
 
PDB2GMX
 
 
Create MDP file
 
 
GROMPP
 
 
MDRUN
 
   
Create file for analysis
 
   
  +
== R ==
Any questions, please write me an email (bitar@rostlab.org).
 
  +
Error in hist.default(a$V2, main = "evals") : 'x' must be numeric
  +
Blast uses non-standard scientific notation and ommits the preceding 1 for eValues like 'e-190'. Change it to '1e-190' and R will stop complaining.

Latest revision as of 11:15, 13 August 2013

Here, we collect descriptions of the software used in the practical. This can be software used in online portals or software installed locally on your own computers or the lab resources. In each case, please describe how to access the software and where to find manuals. Also use this site to collect scripts or HOW_TOs that could be useful for others.

Your own scripts

If you have produced a script that does something that could be useful for others, please "publish" it here. E.g. create a page for your tool where you provide information where to find the software (path on the student cluster, git repository, ...) and how to use it. -- For users: If you use a script produced by another group, please document that (e.g. in the "lab book" part of your wiki section). And if you find bugs, please help the other group improve.

Task 2

Executing blastp, blastpgp and hhblits

A script run.pl executes blastp; blastpgp and hhblits with different options: databases, number of iterations and E-value cutoffs. Also uses checkfiles for blastpgp and outputs PSSMs.

Convert hhr to parseable tsv format

A C program for extraction of statistics results from hhblits output (hhr format) to a tsv (tab separated values) file: hhr2tsv.git

  • Install:
./configure --prefix=$HOME
./make install
./make clean
  • Usage:
$HOME/bin/hhr2tsv <input_hhr_file> <output_tsv_file>

Parser of (Psi-)BLAST and HHblits hhr output files

A script parse_output.pl parses alignments information from (Psi-)BLAST and HHblits hhr output files into tab-separated format, suitable for plotting, calculates the number of hits and overlap of hits with same ID between (Psi-)BLAST and HHblits outputs. Moreover, there is an option to evaluate PDB hits against COPS and create files for plotting.

Plotting of TPR and precision

The script tpr_precision.pl bases on the output files of parse_output.pl and it makes an R-plot of TPR and precision as a function of E-value of the hits.

Script for comparing the CATH fold classes of the quer and the pdb hits

The script compareCath.py reads the output from parse_output.pl (see above) and compares the fold classes of the query domains with the fold classes of the hits and writes a histogram to stdout.

Script for finding GOAnnotations

This Script finds GOAnnotations for a given Protein and creates an outfile.
The script can be found here
A typical command would be: python goAnnotation.py B2JCG3 /Desktop/result.out

Task 3

Script to filter the secondary structure of reprof output files

This script reads the output of a ReProf, a PsiPred or a DSSP run and filters for the secondary structure: filter_secStruc.pl

Calculation precision between two secondary structure sequences

This script reads two sequences in the format given by the output of filter_secStruc.pl and calculates the precision between those: SecStrucComparison.jar

Automatically run Polyphobius

This script combines blastget, kalign and jphobius together. Run polyphobius should be easier for us. You can find the code here (polyphobius.pl).

Task 5

Wrapper program for modeller

This program intends to help you to run modeller automatically. Both single template modelling and multi-template modelling are supported. You can find the program here: Modeller.py

Usage:

Usage: Modeller.py \
	--template template.pdb[,template2.pdb[,template3.pdb]] \
	--chain chain[,chain2,[chain3]]	--target target.pir \
	--align alignment-file.ali [--has-align]

It has following parameters:

  • --template template protein structures (multiple structures are separated by commas)
  • --chain selected chains from template protein structures (multiple templates are separated by commas)
  • --target target sequence in PIR
  • --align file to store sequence alignments
  • --has-align OPTIONAL: whether we have already have alignment file

Convert table to wiki format

R script for formatting a table (e. g. in csv format) to wiki table format: format_to_wiki_table.r

Task 8

Calculator for residue conservation in MSA

This python script helps you to calculate conservation of residues in a MSA in FASTA format. First entry in the FASTA file should be the query sequence.

Usage: python msa-conservation.py <MSA.fasta> residue_pos1,[residue_pos2,[...]]

Molecular visualization

To look at protein structures you can use any molecular visualization programm. Here are a few options:

  • PyMOL -> installed on the i12k-biolab computers
  • Jmol, e.g. via PDB
  • VMD
  • the SRS 3D server -- unfortunately not working any more. Maybe Aquaria will become publicly available within this practical.

Changing Blast output

By default, Blast lists 500 search hits and 250 alignment details. This can be changed (see Blast manual for details):

  • You can use a custom output format to get a table with "-m 8" (see "-help" or this hint on how to parse Blast output).
  • You can use "-b" to set the number of alignments to be shown, "-b 20000" is the maximum.


Modeller

Troubleshooting

A very common error from Modeller is the following: "Sequence difference between alignment and pdb" . This usually means the structure of the template available in PDB (which was experimentally solved) has missing residues, which could be a result of technical problems with the X-ray diffraction data. Therefore, you need to make sure sure target-template alignment uses the sequence implied in the ATOM records, not the SEQRES record. To locate the error you could e.g. generate a fasta sequence can be generated from the PDB file coordinates, align this sequence with the fasta sequence for the SEQRES sequence and check for missing residues (gaps within the alignment). If residues are missing, regenerate you target template alignment based on the new fasta sequence made from the coordinates.

SNAP

There is a very brief explanation about SNAP available here --> Media:SNAP.pdf.

Energy Minimization

There is a script to automatically run energy minimizations with Gromacs here --> MutEn.pl.


R

Error in hist.default(a$V2, main = "evals") : 'x' must be numeric

Blast uses non-standard scientific notation and ommits the preceding 1 for eValues like 'e-190'. Change it to '1e-190' and R will stop complaining.