Tpr precision.pl

From Bioinformatikpedia

tpr_precision.pl creates the evaluation plots of TPR and precison depending on E-value using R. You can find the script here on biocluster: /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1 or just copy the code from the bottom part of this page.

Usage

perl tpr_precision.pl 
--f <file>: name of the original output file. Two parsed files with the following endings must be in the given directory:
- a file with the ending "_L*_positives", * = 30 / 40 / 60 in tab-separated format:
#query_id positives [predicted_positives] ([...] is optional)
where positives( = P) = number hits in COPS file with the same L* group as the query
- a file with the ending "_L*_TP_FP", * = 30 / 40 / 60 in tab-separated format:
#E-value TP/FP query_id hit_id
--m <main>: program name and used parameters in the title of the evaluation plots (e.g. "2 HHblits iterations")>
--p : if few data points (plot character is used instead of line)
--h <print help>

How the method works

Each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits

Code

<source lang="perl">

  1. !/usr/bin/perl -w

use strict; use Getopt::Long; use File::Temp qw/tempfile/;

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 0 tpr_precision.pl Creates the evaluation plots of TPR and precison depending on E-value. input: -f <file>: name of the original output file. Two parsed files with the following endings must be in the given directory: - a file with the ending "_L*_positives", * = 30 / 40 / 60 in tab-separated format: #query_id positives [predicted_positives] ([...] is optional) where positives( = P) = number hits in COPS file with the same L* group as the query - a file with the ending "_L*_TP_FP", * = 30 / 40 / 60 in tab-separated format: #E-value TP/FP query_id hit_id -m <main>: program name and used parameters in the title of the evaluation plots (e.g. \"2 HHblits iterations\") -p : if few data points output: - file for the evaluation plots: #Evalue TPR( = TP/P = TP/(TP+FN) precision( = TP/(TP+FP) )

	- evaluation plot:

- evalue on x axis, TPR and precision on y axis (.._eval_tpr_precision.png) =cut

  1. -----------------------------------------------------------------------------------------------------------------------------------
  2. Parses the command line parameters

our($f, $m, $p, $h);

my $args_ok = GetOptions( 'f=s' => \$f, #name of the original output file 'm=s' => \$m, #program name and used parameters in the title of the evaluation plots 'p' => \$p, #if few data points 'h' => \$h #print help );

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 1 Subroutine print_err Prints error message. output: stdout =cut sub print_err{

 print "Failed to execute tpr_precision.pl\n" .

"Usage: perl tpr_precision.pl --f <file>: name of the original output file. Two parsed files with the following endings must be in the given directory: - a file with the ending \"_L*_positives\", * = 30 / 40 / 60 in tab-separated format: #query_id positives [predicted_positives] ([...] is optional) where positives( = P) = number hits in COPS file with the same L* group as the query - a file with the ending \"_L*_TP_FP\", * = 30 / 40 / 60 in tab-separated format: #E-value TP/FP query_id hit_id --m <main>: program name and used parameters in the title of the evaluation plots (e.g. \"2 HHblits iterations\")> --p : if few data points (plot character is used instead of line) --h <print help>\n";

 exit 0;

} if(!$f || !$m || $h){ print_err; }

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 2 Subroutine sort_evalue Sorts the file according to accending e-value (1st column) using bash sort command input: $file (full file name) output: sorted file ending with "_sorted" =cut sub sort_evalue{ my ($file) = @_; system ("cat $file | sort -g -s > $file"."_sorted"); #-s stabilize sort by disabling last-resort comparison }

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 3 Subroutine calc_precision Calculates precision so far using the formula: precision = TP/(TP+FP) input: (TP, pred_P) output: precision value =cut sub calc_precision{ my ($TP, $FP) = @_; return $TP/($TP + $FP); }

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 4 Subroutine calc_tpr Calculates TPR using the formula: TPR = TP/P [= TP(TP+FN)] input: (TP, P) output: TPR value =cut sub calc_tpr{ my ($TP, $P) = @_; return $TP/$P; }

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 5 Subroutine read_p_file Reads the --p file (with ending) and fills the hash %query_num and the array @positives input: ($p_file) (full file name) output: %query_num: query -> num @positives: index = num of the query, value = positives of the query =cut sub read_p_file{ my ($p_file) = @_; my %query_num = (); my $query = ""; my $num = 0; my @positives = (); my $pos = 0; my @params = (); open(READ, "$p_file") or die "could not open $p_file for reading"; for my $line (<READ>) { if ($line !~ /^#/) { chomp($line); @params = split(/\t/, $line); $query = $params[0]; $pos = $params[1]; $query_num{$query} = $num; push(@positives, $pos); $num++; } } close READ; return (\%query_num, \@positives); }

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 6 Subroutine eval_tpr_precision Calculates TPR and precision for each e-value in the acsending sorted files and plots the evaluation plot using R. input: ($file, $p_file) (= _L*_TP_FP and _L*_positives files, full file names) output: file for the evaluation plots (#E-value TPR precision) =cut sub eval_tpr_precision{ my ($file, $p_file) = @_; sort_evalue($file); my $sorted_file = $file."_sorted"; (my $query_num_ref, my $P_ref) = read_p_file($p_file); my %query_num = %{$query_num_ref}; my @P = @{$P_ref}; my $size = scalar @P; my @params; ## declare the arrays TP and FP of size $size, all will initialize with 1 (pseudocounts)-> test without psuedocounts my @TP = ((0) x $size); my @FP = ((0) x $size); my $FDR=0; my $TPR=0; my $precision=0; my $eval=0; my $query=""; my $num=0;# number of query so far my @seen = ((0) x $size); my $i=0;# number of query of the current hit my $sum_FDR=0; my $sum_TPR=0; my $sum_precision=0;

$file =~ /(^.*)_TP_FP$/; my $eval_tpr_precision_file = $1."_eval_tpr_precision";

open(READ, "$sorted_file") or die "could not open $sorted_file for reading"; open(WRITE, ">$eval_tpr_precision_file") or die "could not open $eval_tpr_precision_file for writing";

print WRITE "Evalue\tTPR\tprecision\n"; for my $line (<READ>) { if ($line !~ /^#/) { chomp($line); @params = split(/\t/, $line); $eval = $params[0]; $query = $params[2];

if (defined $query_num{$query}) { $i = $query_num{$query}; if($params[1] eq "TP"){ $TP[$i]++; }elsif($params[1] eq "FP"){ $FP[$i]++; }else{ print "error!\n"; }

if ($seen[$i] == 0){ $num++; $seen[$i] = 1; }

$sum_TPR=0; $sum_precision=0; for (my $j=0; $j<$size; $j++) { if (($FP[$j]>=1 || $TP[$j]>=1) && $P[$j]>=1) { $sum_TPR += calc_tpr($TP[$j], $P[$j]); $sum_precision += calc_precision($TP[$j], $FP[$j]); } } $TPR=$sum_TPR/$num; $precision=$sum_precision/$num; print WRITE "$eval\t$TPR\t$precision\n"; } } } close READ; close WRITE; } eval_tpr_precision($f."_L30_TP_FP", $f."_L30_positives"); eval_tpr_precision($f."_L40_TP_FP", $f."_L40_positives"); eval_tpr_precision($f."_L60_TP_FP", $f."_L60_positives");

  1. -----------------------------------------------------------------------------------------------------------------------------------

=head 7 Subroutine plot Created plot of TPR and precision as a function of E-value using R. input: ($file, $main, $p) (file path and name as given in --f and title informations give in --m) output: the plot in png format =cut sub plot{ my ($file, $main, $p) = @_; my $eval_tpr_precision_file_L30 = $file."_L30_eval_tpr_precision"; my $eval_tpr_precision_file_L40 = $file."_L40_eval_tpr_precision"; my $eval_tpr_precision_file_L60 = $file."_L60_eval_tpr_precision"; my $eval_tpr_precision_png = $file."_eval_tpr_precision.png";

my $col = ""; if ($file =~ /psiblast/) { $col = "blue"; }elsif ($file =~ /blast/) { $col = "turquoise3"; } elsif ($file =~ /hhblits/) { $col = "red"; }

(my $TEMPHANDLE, my $tempfilename) = tempfile();

if(!$p){ #plot lines print $TEMPHANDLE(" vL30 = read.table(\"$eval_tpr_precision_file_L30\", header=TRUE) vL40 = read.table(\"$eval_tpr_precision_file_L40\", header=TRUE) vL60 = read.table(\"$eval_tpr_precision_file_L60\", header=TRUE) log_evalL30 = -log10(as.numeric(vL30\$Evalue)) log_evalL40 = -log10(as.numeric(vL40\$Evalue)) log_evalL60 = -log10(as.numeric(vL60\$Evalue))

png(\"$eval_tpr_precision_png\")

plot(log_evalL30, vL30\$TPR * 100, ylim=c(0, 100), main=\"TPR and precision vs. E-value of $main \nusing the L30, L40 and L60 COPS groups as gold standard\", xlab=\"-log(E-value)\", ylab=\"%\", cex.lab=1, type=\"l\", col=\"$col\", lwd=2) lines(log_evalL30, vL30\$precision * 100, col=\"purple\", lwd=2)

lines(log_evalL40, vL40\$TPR * 100, col=\"$col\", lty=2, lwd=2) lines(log_evalL40, vL40\$precision * 100, col=\"purple\", lty=2, lwd=2)

lines(log_evalL60, vL60\$TPR * 100, col=\"$col\", lty=3, lwd=3) lines(log_evalL60, vL60\$precision * 100, col=\"purple\", lty=3, lwd=3)

abline(h=seq(0, 100, by=5), lty=1, col=\"lightgray\") abline(v=seq(-10, 330, by=10), lty=1, col=\"lightgray\")

legend(\"center\",c(\"sensitivity L30\",\"sensitivity L40\",\"sensitivity L60\",\"precision L30\",\"precision L40\",\"precision L60\"),cex=1,col=c(\"$col\",\"$col\",\"$col\",\"purple\",\"purple\",\"purple\"),lty=1:3, lwd=c(2,2,3,2,2,3)) dev.off() "); }else{ #--p given -> plot characters and do not convert E-value to -log print $TEMPHANDLE(" vL30 = read.table(\"$eval_tpr_precision_file_L30\", header=TRUE) vL40 = read.table(\"$eval_tpr_precision_file_L40\", header=TRUE) vL60 = read.table(\"$eval_tpr_precision_file_L60\", header=TRUE)

png(\"$eval_tpr_precision_png\")

plot(vL30\$Evalue, vL30\$TPR * 100, ylim=c(0, 100), main=\"TPR and precision vs. E-value of $main \nusing the L30, L40 and L60 COPS groups as gold standard\", xlab=\"E-value\", ylab=\"%\", cex.lab=1, type=\"l\", col=\"$col\", pch=21) points(vL30\$Evalue, vL30\$precision * 100, col=\"purple\", pch=21)

points(vL40\$Evalue, vL40\$TPR * 100, col=\"$col\", pch=22) points(vL40\$Evalue, vL40\$precision * 100, col=\"purple\", pch=22)

points(vL60\$Evalue, vL60\$TPR * 100, col=\"$col\", pch=23) points(vL60\$Evalue, vL60\$precision * 100, col=\"purple\", pch=23)

abline(h=seq(0, 100, by=5), lty=1, col=\"lightgray\")

legend(\"center\",c(\"sensitivity L30\",\"sensitivity L40\",\"sensitivity L60\",\"precision L30\",\"precision L40\",\"precision L60\"),cex=1,col=c(\"$col\",\"$col\",\"$col\",\"purple\",\"purple\",\"purple\"), pch=c(21,22,23,21,22,23)) dev.off() "); }

system("Rscript $tempfilename"); } plot($f, $m, $p); </source>