Difference between revisions of "Parse output.pl"
Kalemanovm (talk | contribs) |
Kalemanovm (talk | contribs) |
||
Line 1: | Line 1: | ||
You can find the script <code>parse_output.pl</code> here on biocluster: <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1</code>. The usage of the script till now is '''parsing of (Psi-)BLAST and HHblits hhr output files''': |
You can find the script <code>parse_output.pl</code> here on biocluster: <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1</code>. The usage of the script till now is '''parsing of (Psi-)BLAST and HHblits hhr output files''': |
||
− | Usage: perl parse_output.pl --out_h <hhblits hhr file> [--out_p <(psi-)blast output file>] |
+ | Usage: perl parse_output.pl --out_h <hhblits hhr file> [--out_p <(psi-)blast output file>] |
− | Optional parameters: |
+ | Optional parameters: |
− | --pdb70, if HHblits was done against pdb70 |
+ | --pdb70, if HHblits was done against pdb70 |
− | --pdb_full, if HHblits was done against pdb_full |
+ | --pdb_full, if HHblits was done against pdb_full |
'''Note:''' |
'''Note:''' |
||
Line 10: | Line 10: | ||
* For (Psi-)BLAST, outputs of searches in '''big''' and '''big80''' can be parsed (no flags required). |
* For (Psi-)BLAST, outputs of searches in '''big''' and '''big80''' can be parsed (no flags required). |
||
* As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script '''<code>devide_psiblast_out.pl</code>''' (found in the same directory). For example, if the original output file <code>psiblast-big-2iter</code> has results from two iterations, if will be split into two files <code>psiblast-big-2iter_1</code> (first iteration) and <code>psiblast-big-2iter_2</code> (second iterations). The files will be written into the same directory. Then you can be parse the iteration you need with <code>parse_output.pl</code>. <br> |
* As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script '''<code>devide_psiblast_out.pl</code>''' (found in the same directory). For example, if the original output file <code>psiblast-big-2iter</code> has results from two iterations, if will be split into two files <code>psiblast-big-2iter_1</code> (first iteration) and <code>psiblast-big-2iter_2</code> (second iterations). The files will be written into the same directory. Then you can be parse the iteration you need with <code>parse_output.pl</code>. <br> |
||
− | Usage: perl devide_psiblast_out.pl <psiblast output file |
+ | Usage: perl devide_psiblast_out.pl <psiblast output file> |
The output of <code>parse_output.pl</code> is a tab-separated file with the columns: |
The output of <code>parse_output.pl</code> is a tab-separated file with the columns: |
||
Line 24: | Line 24: | ||
'''Note:''' The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output). <br> |
'''Note:''' The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output). <br> |
||
− | + | For '''evaluation of PDB hits against COPS''' and creation of files for plotting these additional parameters should be given: |
|
+ | Mandatory: |
||
+ | --query <query PDB chain> |
||
+ | --sot <standard of truth COPs group, e.g. L30> |
||
+ | Optional: |
||
+ | --e <evalue cutoff for inclusion in the evaluation> |
||
+ | |||
+ | Output: |
||
+ | *stdout: |
||
+ | Number of hits (L30, L40, L60) |
||
+ | True positives (= TP; same ".$sot.") |
||
+ | False positives (= FP; different ".$sot.") |
||
+ | Predicted positives (= TP+FP) |
||
+ | Positives (= TP+FN) |
||
+ | precision TP/(TP+FP) |
||
+ | sensitivity(TPR) = TP/(TP+FN) |
||
+ | *files: |
||
+ | _L60, _L40, _L30, _NoL30 (#query_id hit_id evalue identity length) |
||
+ | _LXX_TP_FP (#evalue TP/FP query_id hit_id) |
||
+ | _LXX_positives (#query_id positives predicted_positives) |
Revision as of 16:20, 5 May 2013
You can find the script parse_output.pl
here on biocluster: /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1
. The usage of the script till now is parsing of (Psi-)BLAST and HHblits hhr output files:
Usage: perl parse_output.pl --out_h <hhblits hhr file> [--out_p <(psi-)blast output file>] Optional parameters: --pdb70, if HHblits was done against pdb70 --pdb_full, if HHblits was done against pdb_full
Note:
- The flag
--pdb_full
must be given if HHblits run was performed against the pdb_full database and--pdb70
must be given if the clustered pdb70 database was used. If uniprot20 was used, no extra flag has to be given. It is because the databases have different formats of headers of the cluster master sequences, where the IDs of cluster members are listed (and for pdb70 an extra mapping must be used). - For (Psi-)BLAST, outputs of searches in big and big80 can be parsed (no flags required).
- As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script
devide_psiblast_out.pl
(found in the same directory). For example, if the original output filepsiblast-big-2iter
has results from two iterations, if will be split into two filespsiblast-big-2iter_1
(first iteration) andpsiblast-big-2iter_2
(second iterations). The files will be written into the same directory. Then you can be parse the iteration you need withparse_output.pl
.
Usage: perl devide_psiblast_out.pl <psiblast output file>
The output of parse_output.pl
is a tab-separated file with the columns:
- id
- evalue
- identity
- similarity
- length
- score
- probabilty (only for HHblits)
The number of found hits is outputted onto stdout. If both HHblits and (Psi-)BLAST files are given, the overlap of hits with the same ID is calculated.
Note: The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output).
For evaluation of PDB hits against COPS and creation of files for plotting these additional parameters should be given:
Mandatory: --query <query PDB chain> --sot <standard of truth COPs group, e.g. L30> Optional: --e <evalue cutoff for inclusion in the evaluation>
Output:
- stdout:
Number of hits (L30, L40, L60) True positives (= TP; same ".$sot.") False positives (= FP; different ".$sot.") Predicted positives (= TP+FP) Positives (= TP+FN) precision TP/(TP+FP) sensitivity(TPR) = TP/(TP+FN)
- files:
_L60, _L40, _L30, _NoL30 (#query_id hit_id evalue identity length) _LXX_TP_FP (#evalue TP/FP query_id hit_id) _LXX_positives (#query_id positives predicted_positives)