Difference between revisions of "Parse output.pl"
Kalemanovm (talk | contribs) |
Kalemanovm (talk | contribs) (→Usage of the script for parsing of (Psi-)BLAST and HHblits hhr output files) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | You can find the script <code>parse_output.pl</code> here on biocluster: <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1</code> or just copy the code from the bottom part of this page. |
+ | You can find the script <code>parse_output.pl</code> here on biocluster: <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1</code> or just copy the code from the bottom part of this page. |
− | + | ==Usage of the script for parsing of (Psi-)BLAST and HHblits hhr output files== |
|
+ | |||
+ | perl parse_output.pl --out_h <hhblits hhr file> [--out_p <(psi-)blast output file>] |
||
Optional parameters: |
Optional parameters: |
||
--pdb70, if HHblits was done against pdb70 |
--pdb70, if HHblits was done against pdb70 |
||
Line 9: | Line 11: | ||
* The flag '''<code>--pdb_full</code>''' must be given if HHblits run was performed against the '''pdb_full''' database and '''<code>--pdb70</code>''' must be given if the clustered '''pdb70''' database was used. If '''uniprot20''' was used, no extra flag has to be given. It is because the databases have different formats of headers of the cluster master sequences, where the IDs of cluster members are listed (and for pdb70 an extra mapping must be used). |
* The flag '''<code>--pdb_full</code>''' must be given if HHblits run was performed against the '''pdb_full''' database and '''<code>--pdb70</code>''' must be given if the clustered '''pdb70''' database was used. If '''uniprot20''' was used, no extra flag has to be given. It is because the databases have different formats of headers of the cluster master sequences, where the IDs of cluster members are listed (and for pdb70 an extra mapping must be used). |
||
* For (Psi-)BLAST, outputs of searches in '''big''' and '''big80''' can be parsed (no flags required). |
* For (Psi-)BLAST, outputs of searches in '''big''' and '''big80''' can be parsed (no flags required). |
||
− | * As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script |
+ | * As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script [[devide_psiblast_out.pl]]. For example, if the original output file <code>psiblast-big-2iter</code> has results from two iterations, if will be split into two files <code>psiblast-big-2iter_1</code> (first iteration) and <code>psiblast-big-2iter_2</code> (second iterations). The files will be written into the same directory. Then you can be parse the iteration you need with <code>parse_output.pl</code>. <br> |
Usage: perl devide_psiblast_out.pl <psiblast output file> |
Usage: perl devide_psiblast_out.pl <psiblast output file> |
||
+ | ===Output=== |
||
The output of <code>parse_output.pl</code> is a tab-separated file with the columns: |
The output of <code>parse_output.pl</code> is a tab-separated file with the columns: |
||
*id |
*id |
||
Line 24: | Line 27: | ||
'''Note:''' The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output). <br> |
'''Note:''' The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output). <br> |
||
− | + | ==Usage for evaluation of PDB hits against COPS== |
|
+ | For evaluation of PDB hits against COPS and creation of files for plotting these additional parameters should be given: |
||
Mandatory: |
Mandatory: |
||
--query <query (equivalent) PDB chain> for validation - results must contain PDB hits! |
--query <query (equivalent) PDB chain> for validation - results must contain PDB hits! |
||
Line 31: | Line 35: | ||
--e <evalue cutoff for inclusion in the evaluation> |
--e <evalue cutoff for inclusion in the evaluation> |
||
− | Output |
+ | ===Output=== |
+ | The output after using the evaluation options is: |
||
*stdout: |
*stdout: |
||
Number of hits (L30, L40, L60) |
Number of hits (L30, L40, L60) |
||
Line 46: | Line 51: | ||
− | Code |
+ | ==Code== |
<source lang="perl"> |
<source lang="perl"> |
||
#!/usr/bin/perl -w |
#!/usr/bin/perl -w |
Latest revision as of 12:10, 21 May 2013
You can find the script parse_output.pl
here on biocluster: /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1
or just copy the code from the bottom part of this page.
Contents
Usage of the script for parsing of (Psi-)BLAST and HHblits hhr output files
perl parse_output.pl --out_h <hhblits hhr file> [--out_p <(psi-)blast output file>] Optional parameters: --pdb70, if HHblits was done against pdb70 --pdb_full, if HHblits was done against pdb_full
Note:
- The flag
--pdb_full
must be given if HHblits run was performed against the pdb_full database and--pdb70
must be given if the clustered pdb70 database was used. If uniprot20 was used, no extra flag has to be given. It is because the databases have different formats of headers of the cluster master sequences, where the IDs of cluster members are listed (and for pdb70 an extra mapping must be used). - For (Psi-)BLAST, outputs of searches in big and big80 can be parsed (no flags required).
- As Psi-BLAST output file contains results for each iteration, it must be devided according to iterations before, which can easily be done with the script devide_psiblast_out.pl. For example, if the original output file
psiblast-big-2iter
has results from two iterations, if will be split into two filespsiblast-big-2iter_1
(first iteration) andpsiblast-big-2iter_2
(second iterations). The files will be written into the same directory. Then you can be parse the iteration you need withparse_output.pl
.
Usage: perl devide_psiblast_out.pl <psiblast output file>
Output
The output of parse_output.pl
is a tab-separated file with the columns:
- id
- evalue
- identity
- similarity
- length
- score
- probabilty (only for HHblits)
The number of found hits is outputted onto stdout. If both HHblits and (Psi-)BLAST files are given, the overlap of hits with the same ID is calculated.
Note: The script "filters the duplicates": if more than one HSPs with the same ID are found in one output file, only one HSP with the lowest E-value is taken (for both the calculations and the output).
Usage for evaluation of PDB hits against COPS
For evaluation of PDB hits against COPS and creation of files for plotting these additional parameters should be given:
Mandatory: --query <query (equivalent) PDB chain> for validation - results must contain PDB hits! --sot <standard of truth COPs group: L30, L40 or L60> Optional: --e <evalue cutoff for inclusion in the evaluation>
Output
The output after using the evaluation options is:
- stdout:
Number of hits (L30, L40, L60) True positives (= TP; same ".$sot.") False positives (= FP; different ".$sot.") Predicted positives (= TP+FP) Positives (= TP+FN) precision TP/(TP+FP) sensitivity(TPR) = TP/(TP+FN)
- files:
_L60, _L40, _L30, _NoL30 (#query_id hit_id evalue identity length) _LXX_TP_FP (#evalue TP/FP query_id hit_id) _LXX_positives (#query_id positives predicted_positives)
Code
<source lang="perl">
- !/usr/bin/perl -w
- parse_output.pl
- Parses the standard output files of ; BLAST, PSI-BLAST and HHblits (hhr).
- Generates optionally diverse evaluation files, using a given COPS structural group (e.g. L30) as gold standard. Uses the COPS mapping COPS-ChainHierarchy.txt
use strict; use Bio::SearchIO; use Getopt::Long; use feature qw/switch/;
- Parses the command line parameters
our($out_p, $out_h, $pdb70, $pdb_full, $query, $sot, $e);
my $args_ok = GetOptions( 'out_p=s' => \$out_p, #path of the output file of Psi-BLAST (or BLAST) to be parsed 'out_h=s' => \$out_h, #path of the output file of HHblits (or HHsearch) to be parsed 'pdb70' => \$pdb70, #if HHblits was done against pdb70 'pdb_full' => \$pdb_full, #if HHblits was done against pdb_full 'query=s' => \$query, #query (equivalent) PDB chain for validation 'sot=s' => \$sot, #standard of truth COPS group: L30, L40 or L60 'e=s' => \$e, #evalue cutoff for inclusion in the evaluation files and calculations );
- Example: perl ~/scripts/parse_output.pl --out_h 1tim_A_hhblits_2iter --pdb70 --query 1tim_A --sot L30
- COPS-ChainHierarchy.txt file used for validation of PDB hits:
my $cops = "/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt";
- -----------------------------------------------------------------------------------------------------------------------------------
=head 0 Subroutine print_err Prints error message. output: stdout =cut sub print_err{
print "Failed to execute parse_output.pl\n" .
"Usage: perl parse_output.pl --out_h <output hhblits file> [--out_p <output psiblast file>]\n". "Optional parameters:\n" . "--pdb70, if HHblits was done against pdb70\n" . "--pdb_full, if HHblits was done against pdb_full\n" . "--query <query (equivalent) PDB chain> for validation - results must contain PDB hits!\n" . "--sot <standard of truth COPS group: L30, L40 or L60>\n" . "--e <evalue cutoff for inclusion in the evaluation>\n";
exit(0);
}
- -----------------------------------------------------------------------------------------------------------------------------------
if (!$out_p && !$out_h){ print "At least one of the parameters --out_p and --out_h must be given.\n"; print_err; exit(0); }
- -----------------------------------------------------------------------------------------------------------------------------------
=head 1 Subroutine parse_psiblast Parses the Psi-BLAST output file and creates data file for plotting the distribution of evalue, identity, similarity, length and score. input: ($out_p) output: \@hsps_p: {id} -> (evalue, identity, similarity, length, score) file "$out_p."_results" in format: "#id\tevalue\tidentity\tsimilarity\tlength\tscore\n" =cut sub parse_psiblast{
my ($out_p) = @_; my %hsps_p; #format: {id} -> (evalue, identity, similarity, length, score)
print "*** Parsing Psi-BLAST output file $out_p and creating file for plotting the distribution of evalue, identity, similarity, length and score ***\n"; open (WRITE, ">" . $out_p . "_results") or die "could not open $out_p" . "_results for writing"; print WRITE "#id\tevalue\tidentity\tsimilarity\tlength\tscore\n"; #header #print "id\tevalue\tidentity\tsimilarity\tlength\tscore\n"; #header
my $blast_in = new Bio::SearchIO(-format => 'blast',
-file => $out_p);
while (my $result = $blast_in->next_result) { #print "Query=", $result->query_name, " | ", $result->query_description, "; length=", $result->query_length,"\n"; my $id;
my $short_id;
# hits: while (my $hit = $result->next_hit) { $id = $hit->name; if($id =~ /[sp|tr]\|(.*)\|/) { #e.g. tr|E7ENQ1|E7ENQ1_HUMAN, but we want only trembl-id: E7ENQ1, same for sp|... (swissprot)
if ($query) { #if --query given => for validation => consider only PDB hits for (Psi-)BLAST next; }else{ $short_id = $1; }
}elsif ($id =~ /pdb\|pdb\|(.*)/) { #e.g. pdb|pdb|2wkl_B, but we want only PDB ID with chain: 2wkl_B
$short_id = $1; }
# HSPs: while (my $hsp = $hit->next_hsp) { my $evalue = $hsp->evalue; my $identity = $hsp->frac_identical; my $similarity = $hsp->frac_conserved; my $length = $hsp->length('total'); my $score = $hsp->bits; my @hsp_p_values = ($evalue, $identity, $similarity, $length, $score); if (exists $hsps_p{$short_id}) { # if a HSP with the same ID exits - take that with the lower E-value if ($evalue < ${$hsps_p{$short_id}}[0]) { $hsps_p{$short_id} = \@hsp_p_values;
#print "$id\t$evalue\t$identity\t$similarity\t$length\t$score\n"; print WRITE "$id\t$evalue\t$identity\t$similarity\t$length\t$score\n";
} } else { $hsps_p{$short_id} = \@hsp_p_values;
#print "$id\t$evalue\t$identity\t$similarity\t$length\t$score\n"; print WRITE "$id\t$evalue\t$identity\t$similarity\t$length\t$score\n";
}
}
} } close WRITE; return \%hsps_p;
}
- -----------------------------------------------------------------------------------------------------------------------------------
- Parsing Psi-BLAST with the subroutine parse_psiblast and saving the returned hash:
my %hsps_p; #format: {id} -> (evalue, identity, similarity, length, score) if ($out_p) {
%hsps_p = %{parse_psiblast($out_p)};
}
- -----------------------------------------------------------------------------------------------------------------------------------
=head 2 Subroutine parse_hhblits (uses the subs find_clusters (6) and pdb_clusters (8)). Parses the HHblits output file and creates data file for plotting the distribution of evalue, identity, similarity, length, score and probability. input: $out_h output: \@hsps_h: {id} -> (evalue, identity, similarity, length, score, probability) file "$out_h."_results" in format: "#id\tevalue\tidentity\tsimilarity\tlength\tscore\tprobabilty\n =cut sub parse_hhblits{
my ($out_h, $cops) = @_; my %hsps_h; #format: {id} ->(evalue, identity, similarity, length, score, probability) print "*** Parsing HHblits output file $out_h and creating file for plotting the distribution of evalue, identity, similarity, length, score and probability ***\n"; open (WRITE, ">" . $out_h . "_results") or die "could not open $out_h" . "_results for writing"; print WRITE "#id\tevalue\tidentity\tsimilarity\tlength\tscore\tprobability\n"; #header #print "id\tevalue\tidentity\tsimilarity\tlength\tscore\tprobability\n"; #header
open (READ, "$out_h") or die "could not open $out_h"; my $align_lines = 0; my $cluster_id = ""; my $id = ""; my @id_list; my $evalue; my $identity; my $similarity; my $length; my $score; my $probability; for my $line (<READ>) { if ($line =~ /No 1 /) { $align_lines = 1; } if ($align_lines && $line =~ /^>(\S+)\s+(.+)$/) { #parsing 1st line of alignments $cluster_id = $1; my $sp = $2; #specifier containing all other ids belonging to the cluster @id_list = ();
if ($pdb_full){ #include all equivalent PDB IDs in the header line (after "PDB:") and the master ID itself push(@id_list, $cluster_id); if ($sp =~ /PDB: (.+)$/){ $sp = $1; while ($sp =~ /(\w{4}_\w)/g) { my $pdb_id = $1; push(@id_list, $pdb_id); } }
}elsif($pdb70) { #include all PDB IDs from pdb70 cluster my $pdb70_hash_ref = pdb_clusters(); my %pdb70 = %{$pdb70_hash_ref}; if (defined $pdb70{$cluster_id}){ @id_list = @{$pdb70{$cluster_id}}; }else{ # cluster_id is not found in pdb70clusters.txt! # Executing of the subroutine find_clusters for cluster_id: my @LXXgroups = find_clusters($cops, $cluster_id); my $L30_cluster_hit = $LXXgroups[0]; if ($L30_cluster_hit eq "") { # cluster_id is also not found in COPs file -> don't count it print $cluster_id." not found in pdb70clusters.txt and in COPs file -> doesn't count.\n"; } else { # cluster_id exists in COPs push(@id_list, $cluster_id); # if cluster_id does not exist in the cluster file but exists in COPs file -> use only the cluster hit print $cluster_id." not found in pdb70clusters.txt.\n"; } }
} else { # include all IDs of the uniprot20 cluster in the header line while ($sp =~ /\|(\w{6})/g) { my $uniprot_id = $1; push(@id_list, $uniprot_id); } } }
if ($align_lines && $line =~ /^Probab/) { #parsing 2nd line of alignment
my @p = split(/\s+/, $line);
$probability = $p[0];
$evalue = $p[1]; $score = $p[2]; $length = $p[3];
$identity = $p[4];
$similarity = $p[5];
$evalue =~ /=(.+)/; $evalue = $1; $identity =~ /=(.+)\%/; $identity = $1;
$identity = ($identity/100.0);
$similarity =~ /=(.+)/; $similarity = $1; $length =~ /=(.+)/; $length = $1; $score =~ /=(.+)/; $score = $1;
$probability =~ /=(.+)/; $probability = $1;
my @id_list_loc = @id_list; # list of all the cluster ids
my @hsp_h_values = ($evalue, $identity, $similarity, $length, $score, $probability); # the values are same for all the cluster hits
for my $id (@id_list_loc){ # save all the cluster hits in the hash if (exists $hsps_h{$id}) { if ($evalue < ${$hsps_h{$id}}[0]) { # save only the HSP with the lowest evalue $hsps_h{$id} = \@hsp_h_values; #print "$id\t$evalue\t$identity\t$similarity\t$length\t$score\t$probability\n"; print WRITE "$id\t$evalue\t$identity\t$similarity\t$length\t$score\t$probability\n"; } } else { $hsps_h{$id} = \@hsp_h_values; #print "$id\t$evalue\t$identity\t$similarity\t$length\t$score\t$probability\n"; print WRITE "$id\t$evalue\t$identity\t$similarity\t$length\t$score\t$probability\n"; } } } }
close READ;
close WRITE;
return (\%hsps_h);
}
- -----------------------------------------------------------------------------------------------------------------------------------
- Parsing HHblits with the subroutine parse_hhblits and saving the returned hash:
my %hsps_h; #format: {id} -> (evalue, identity, similarity, length, score, probability) if ($out_h) {
%hsps_h = %{parse_hhblits($out_h, $cops)};
}
- -----------------------------------------------------------------------------------------------------------------------------------
=head 3 Subroutine number_hits_psiblast Gives number of hits in Psi-BLAST output (numer hits because only one HSPs per hit parsed - one with the smallest E-value). input: ($hsps_p_ref) output: stdout =cut sub number_hits_psiblast{
my ($hsps_p_ref) = @_; my %hsps_p = %{$hsps_p_ref}; print "Psi-BLAST $out_p:\n"; print "Num hits= " . scalar(keys %hsps_p) . "\n\n";
}
- -----------------------------------------------------------------------------------------------------------------------------------
=head 4 Subroutine number_hits_hhblits Gives number of hits in HHblits output (number hits because only one HSPs per hit parsed - one with the smallest E-value). input: ($hsps_h_ref) output: stdout =cut sub number_hits_hhblits{
my ($hsps_h_ref) = @_; my %hsps_h = %{$hsps_h_ref}; print "HHblits $out_h:\n"; print "Num hits= " . scalar(keys %hsps_h) . "\n\n";
}
- -----------------------------------------------------------------------------------------------------------------------------------
=head 5 Subroutine find_hit_overlap Gives number of shared hits in (Psi-)BLAST and HHblits output. input: ($hsps_p_ref, $hsps_h_ref) output: stdout =cut sub find_hit_overlap{
my ($hsps_p_ref, $hsps_h_ref) = @_; my %hsps_p = %{$hsps_p_ref}; my %hsps_h = %{$hsps_h_ref}; my $hit_overlap = 0; foreach my $key_p (keys(%hsps_p)) { if (defined $hsps_h{$key_p}) { $hit_overlap++; } } print "Hit overlap between $out_p and $out_h: " . $hit_overlap . "\n\n";
}
- -----------------------------------------------------------------------------------------------------------------------------------
- Comparing number of hits in Psi-BLAST & HHblits output using Subroutines 3-5:
print "*** Comparing number of hits ***\n";
if ($out_p){
number_hits_psiblast(\%hsps_p);
}
if ($out_h){
number_hits_hhblits(\%hsps_h);
}
if ($out_p && $out_h){
find_hit_overlap(\%hsps_p, \%hsps_h);
}
- -----------------------------------------------------------------------------------------------------------------------------------
=head 6 Subroutine find_clusters Reads the COPS-ChainHierarchy.txt file and finds L30, L40 and L60 homologue-chain clusters for a given query chain. Example for a query chain: "1ea4_K". input: ($cops, $query) output: ($L30, $L40, $L60) =cut sub find_clusters { my ($cops, $query) = @_; # COPS-ChainHierarchy.txt format: Chain L99 L80 L60 L40 L30 S90 S30 # As in the COPS-ChainHierarchy.txt the chains are written like "1ea4,K", substitute "_" with ",": $query =~ tr/_/,/; my @params;
my $L60 = ""; my $L40 = ""; my $L30 = "";
open (READ, "$cops") or die "could not open $cops"; my $found = 0; for my $line (<READ>) { if ($line =~ /^($query)/) { # line where the chain matches with the query $found = 1; @params = split(/\s+/, $line); $L60 = $params[3]; $L40 = $params[4]; $L30 = $params[5]; # translate every chain back: "_" instead of ",": $query =~ tr/,/_/; $L60 =~ tr/,/_/; $L40 =~ tr/,/_/; $L30 =~ tr/,/_/; last; } } if ($found eq 0) { print $query." not found in ".$cops.".\n"; }else{ print "L30 of query: ".$L30."\n". "L40 of query: ".$L40."\n". "L60 of query: ".$L60."\n"; } my @LXXgroups = ($L30, $L40, $L60); return @LXXgroups; }
- -----------------------------------------------------------------------------------------------------------------------------------
=head 7 Subroutine cops_clusters_hashing Reads the COPS-ChainHierarchy.txt and creates LXX (L30, l40 and L60) hashes with all chains with the same LXX group as the query Input: COPS-file, query Output: \%L30, \%L40, \%L60: keys=all members of the same LXX group as the query =cut sub cops_clusters_hashing{ my ($cops, $query) = @_; # Executing of the subroutine find_clusters for the query: my @LXXgroups = find_clusters($cops, $query); my $L30_query = $LXXgroups[0]; if ($L30_query eq ""){ # query not in COPs file - exclude from evaluation print "Query $query not found in $cops and excluded from the evaluation.\n"; exit 0; } my $L40_query = $LXXgroups[1]; my $L60_query = $LXXgroups[2];
my @params = (); my $chain_curr; my $L30_curr; my $L40_curr; my $L60_curr; my %L30 = (); my %L40 = (); my %L60 = ();
open (READ, "$cops") or die "could not open $cops"; for my $line (<READ>) { chomp($line); if ($line !~ /^Chain/) { # not the header line @params = split(/\s+/, $line); $chain_curr = $params[0]; $L60_curr = $params[3]; $L40_curr = $params[4]; $L30_curr = $params[5]; # translate every chain: "_" instead of ",": $chain_curr =~ tr/,/_/; $L60_curr =~ tr/,/_/; $L40_curr =~ tr/,/_/; $L30_curr =~ tr/,/_/;
if($L30_curr eq $L30_query) { $L30{$chain_curr} = "1"; } if($L40_curr eq $L40_query) { $L40{$chain_curr} = "1"; } if($L60_curr eq $L60_query) { $L60{$chain_curr} = "1"; } } } my @LXXmembers = (\%L30, \%L40, \%L60); return @LXXmembers; }
- ----------------------------------------------------------------------------------------------------------------------------------
=head 8 Subroutine pdb_clusters Reads the file pdb70clusters.txt (created by Benjamin Wellman) and creates a hash of chains and all the cluster members. Input: - Output: \%all_cluster_chains{chain}=\@cluster_members -> chain id + list of all cluster members ids =cut sub pdb_clusters { my $pdb70clusters_file = "/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/pdb70clusters.txt"; my %all_cluster_chains = ();
open (READ, "$pdb70clusters_file") or die "could not open $pdb70clusters_file"; for my $line (<READ>) { chomp($line); if ($line !~ /^#/) { # not the header line my @cluster_members = split(/\s+/, $line); # translate every chain in @cluster_members: "1ea4_K" instead of "1ea4K": for my $chain(@cluster_members){ if ($chain =~ /^(\w{4})(\w)$/){ $chain = $1."_".$2; } } ## added (a BUG in BT, because then not all members with "_"!?) # add each chain in @cluster_members to the hash with \@cluster_members as a value: for my $chain(@cluster_members){ ## added if(defined $all_cluster_chains{$chain}) { print "$chain appeared already in pdb70clusters.txt! \n"; } else { $all_cluster_chains{$chain} = \@cluster_members; } } } } return \%all_cluster_chains; }
- ----------------------------------------------------------------------------------------------------------------------------------
=head 9 Subroutine categorize (uses the subs find_clusters (6) and cops_clusters_hashing (7)). - Checks for each pdb-hit if it is in the same L60, L40 or L30 (as given in --sot) homologue-chain cluster as the query or not. If yes, the hit is TP and number TPs ($TPs) is incremented, if no, the hit is FP and number FPs ($FPs) is incremented. - The E-value, sequence identity and length of the hit-query alignment are written in the responsible file of the four: _L60, _L40, _L30, _NoL30 in the following tab separated format: #query_id hit_id evalue identity length - The file _LXX_TP_FP in the following tab separated format is written: #evalue TP/FP query_id hit_id - Depending on the standard of truth(e.g. same L30 COPS groups) the number of positive hits that should have been found ($Ps) is calculated and written to a file: _LXX_positives in following tab separated format: #query_id positives predicted_positives
input: ($hsps_ref, $query, $out, $cops, $sot, $e), with $out= path for output files %hsps format: {id} -> (evalue, identity, similarity, length, score, [probability]) output: - stdout: Number of hits (L30, L40, L60) True positives (= TP; same ".$sot.") False positives (= FP; different ".$sot.") Predicted positives (= TP+FP) Positives (= TP+FN) precision TP/(TP+FP) sensitivity(TPR) = TP/(TP+FN) - files: _L60, _L40, _L30, _NoL30 (#query_id hit_id evalue identity length) _LXX_TP_FP (#evalue TP/FP query_id hit_id) _LXX_positives (#query_id positives predicted_positives) =cut sub categorize { (my $hsps_ref, my $query, my $out, my $cops, my $sot, my $e) = @_; my %hsps = %{$hsps_ref};
# hash all chains that belong to the same LXX groups as the query: my @LXXmembers = cops_clusters_hashing($cops, $query); my %L30_members = %{$LXXmembers[0]}; my %L40_members = %{$LXXmembers[1]}; my %L60_members = %{$LXXmembers[2]}; print "Number of chains with the same COPS structural groups as the query:\n" . "L30: " . scalar(keys %L30_members) . "\n" . "L40: " . scalar(keys %L40_members) . "\n" . "L60: " . scalar(keys %L60_members) . "\n";
if ($e){ #include evalue cutoff for evaluation into the file names: print "Only hits with E-values under the cutoff $e are considered\n"; open (L60, ">".$out."_".$e."_L60") or die "could not open ".$out."_L60 for writing"; open (L40, ">".$out."_".$e."_L40") or die "could not open ".$out."_L40 for writing"; open (L30, ">".$out."_".$e."_L30") or die "could not open ".$out."_L30 for writing"; open (NoL30, ">".$out."_".$e."_NoL30") or die "could not open ".$out."_NoL30 for writing"; open (ROC, ">".$out."_".$e."_".$sot."_TP_FP") or die "could not open ".$out."_".$sot."_TP_FP for writing"; }else{ open (L60, ">".$out."_L60") or die "could not open ".$out."_L60 for writing"; open (L40, ">".$out."_L40") or die "could not open ".$out."_L40 for writing"; open (L30, ">".$out."_L30") or die "could not open ".$out."_L30 for writing"; open (NoL30, ">".$out."_NoL30") or die "could not open ".$out."_NoL30 for writing"; open (ROC, ">".$out."_".$sot."_TP_FP") or die "could not open ".$out."_".$sot."_TP_FP for writing"; }
my $hit = ""; my $evalue = ""; my $identity = ""; my $length = ""; my $probability = "";
my $L60s = 0; my $L40s = 0; my $L30s = 0; my $noL30s = 0;
my $TPs = 0; #true positives my $FPs = 0; #false positives my $Ps = 0; #positives (= true positives + false negatives) my $pred_Ps = 0; #predicted positives (= TPs + FPs) my @cluster_hits; # for all pdb70 cluster members of current hit (HHblits) or only the current hit (PSI-BLAST)
# For each hit:
while ((my $key, my $value) = each %hsps) {
$hit = $key; $evalue = ${$value}[0];
if($e && $evalue > $e){ # only hits with e-values under the cutoff --e are considered
next; } $identity = ${$value}[1]; $length = ${$value}[3]; if (${$value}[5]){ $probability = ${$value}[5]; } # Printing the output evaluation files and counting TPs and FPs: if (defined $L60_members{$hit}) { $L60s++; print L60 $query."\t".$hit."\t".$evalue."\t".$identity."\t".$length."\t".$probability."\n"; if ($sot eq "L60"){ $TPs++; print ROC $evalue."\tTP\t".$query."\t".$hit."\t".$probability."\n"; } } else { if ($sot eq "L60"){ $FPs++; print ROC $evalue."\tFP\t".$query."\t".$hit."\t".$probability."\n"; } }
if (defined $L40_members{$hit}) { $L40s++; print L40 $query."\t".$hit."\t".$evalue."\t".$identity."\t".$length."\t".$probability."\n"; if ($sot eq "L40"){ $TPs++; print ROC $evalue."\tTP\t".$query."\t".$hit."\t".$probability."\n"; } } else { if ($sot eq "L40"){ $FPs++; print ROC $evalue."\tFP\t".$query."\t".$hit."\t".$probability."\n"; } }
if (defined $L30_members{$hit}) { $L30s++; print L30 $query."\t".$hit."\t".$evalue."\t".$identity."\t".$length."\t".$probability."\n"; if ($sot eq "L30"){ $TPs++; print ROC $evalue."\tTP\t".$query."\t".$hit."\t".$probability."\n"; } } else { $noL30s++; print NoL30 $query."\t".$hit."\t".$evalue."\t".$identity."\t".$length."\t".$probability."\n"; if ($sot eq "L30"){ $FPs++; print ROC $evalue."\tFP\t".$query."\t".$hit."\t".$probability."\n"; } } } close L60; close L40; close L30; close NoL30; close ROC;
$pred_Ps = $TPs + $FPs;
given ($sot) { when ("L30") { $Ps = scalar(keys %L30_members); } when ("L40") { $Ps = scalar(keys %L40_members); } when ("L60") { $Ps= scalar(keys %L60_members); } }
open (POSITIVES, ">".$out."_".$sot."_positives") or die "could not open ".$out."_".$sot."_positives for writing"; print POSITIVES $query."\t".$Ps."\t".$pred_Ps."\n"; close POSITIVES;
print "Number of hits = ".scalar(keys %hsps)."\n"; print "\tL60 = ".$L60s."\n"; print "\tL40 = ".$L40s."\n"; print "\tL30 = ".$L30s."\n"; print "\tNo L30 = ".$noL30s."\n"; print "True positives (= TP; same ".$sot.") = ".$TPs."\n"; print "False positives (= FP; different ".$sot.") = ".$FPs."\n"; print "Predicted positives (= TP+FP) = ".$pred_Ps."\n"; print "Positives (= TP+FN) = ".$Ps."\n"; if($pred_Ps != 0){ print "precision TP/(TP+FP) = ".($TPs/$pred_Ps)."\n"; } print "sensitivity(TPR) = TP/(TP+FN) = ".($TPs/$Ps)."\n"; }
- -----------------------------------------------------------------------------------------------------------------------------------
if($query && $sot) { if($out_h) { print "*** Evaluating HHblits results ***\n"; categorize(\%hsps_h, $query, $out_h, $cops, $sot, $e); } if($out_p) { print "*** Evaluating Psi-BLAST results ***\n"; categorize(\%hsps_p, $query, $out_p, $cops, $sot, $e); } } </source>