Phenylketonuria/Task6 Scripts

From Bioinformatikpedia
Revision as of 15:06, 17 August 2013 by Waldraffs (talk | contribs) (Path)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Path

The perl and R scripts described below can be found in /mnt/home/student/waldraffs/masterpractical/Task06.
Back to Lab Journal

contact_map.pl

contact_map.pl is a perl script that extracts all residue pairs with less than 5 Ångstrom minimum atom distance:
perl contact_map.pl -pdb <pdb-file> -out <output-file>

Source code: <source lang="perl">

  1. !/usr/bin/perl

my $pdb; my $out;

  1. Reads the command line parameters

for ($i=0; $i<=@ARGV;$i++){ if($ARGV[$i] eq "-pdb"){ $pdb=$ARGV[$i+1]; } if($ARGV[$i] eq "-out"){ $out=$ARGV[$i+1]; } }

if(!$pdb && !$out){ print "Usage: contact_map.pl -pdb <pdb-file> -out <output-file>." }

  1. reads pdb file and filters for the 'ATOM' lines and saves the residues with their 3D coordinates.

open (FILE,"$pdb") || die $!; my @res; my @x; my @y; my @z; my @res;

while (my $line = <FILE>){ $line =~ s/\s+/ /g; if($line=~ m/^ATOM/){ @splitLine = split(/ /,$line); $acres=$splitLine[5]; push(@{$$acres{aa}},$splitLine[3]); push(@{$$acres{x}}, $splitLine[6]); push(@{$$acres{y}},$splitLine[7]); push(@{$$acres{z}},$splitLine[8]); if((grep /^$acres/, @res)==0){ push(@res,$acres); } } } close FILE;

  1. writes output file

open(OUTFILE,">$out") || die $!;

  1. The residue pairs with a distance of 5 Angstrom are filtered.

my @res_res; $rlength=@res; for($res1=$res[0];$res1<=$res[$rlength-1];$res1++){ for($res2=$res1;$res2<=$res[$rlength-1];$res2++){ $stop=0; $aa1= @{$$res1{aa}}[0]; $aa2= @{$$res2{aa}}[0]; $res_res=$res1."_".$res2; $aa=$aa1."_".$aa2; for($c1=0;$c1<@{$$res1{x}};$c1++){ for($c2=0;$c2<@{$$res2{x}};$c2++){ #if the atoms of a residue pair already has a distance smaller than 5 the distances of the other atomes do not have to be calculated. if($stop==0){ #distance x-coordinate, y-coordinate and z-coordinate $xc = (@{$$res1{x}}[$c1]-@{$$res2{x}}[$c2])*(@{$$res1{x}}[$c1]-@{$$res2{x}}[$c2]); $yc = (@{$$res1{y}}[$c1]-@{$$res2{y}}[$c2])*(@{$$res1{y}}[$c1]-@{$$res2{y}}[$c2]); $zc = (@{$$res1{z}}[$c1]-@{$$res2{z}}[$c2])*(@{$$res1{z}}[$c1]-@{$$res2{z}}[$c2]); #complete residue distance: $dist = sqrt ($xc + $yc + $zc); if($dist<5){ #writes the first and the second residue as well as the combinated residue pair and their amino acids in the output-file print OUTFILE "$res1\t$res2\t$res_res\t$aa\n"; $stop=1; } } } } } } close OUTFILE; </source>

extract_pairs.pl

extract_pairs.pl is a perl script that extracts all residue pairs with distance >5. If such a pair also is included in the output of contact_map.pl it is marked with 'TP' (true positive) else with 'FP' (false positive):
perl extract_pairs.pl -inp <FILE>.evfold -map <contact_map.pl output-file> -out <output-file>

Source code: <source lang="perl">

  1. !/usr/bin/perl

my $inp; my $map; my $out;

  1. Reads the command line parameters

for ($i=0; $i<=@ARGV;$i++){ if($ARGV[$i] eq "-inp"){ $inp=$ARGV[$i+1]; }if($ARGV[$i] eq "-map"){ $map=$ARGV[$i+1]; } if($ARGV[$i] eq "-out"){ $out=$ARGV[$i+1]; } }

if(!$inp && !$out && !$map){ print "Usage: extract_pairs.pl -inp <input-file> -map <pdb-map> -out <output-file>" }

  1. reads the contact_map of the pdb structure created with contact_map.pl

open (MAP,"$map") || die $!; while(my $line = <MAP>){ @splitLine = split(/\t/,$line); push(@res_res, $splitLine[2]); } close MAP;

  1. reads input file (evfold) and filters the residue pairs with distances bigger than 5 residues.

open (FILE,"$inp") || die $!;

  1. writes output file

open(OUTFILE,">$out") || die $!;

while (my $line = <FILE>){ chomp($line); @splitLine = split(/ /,$line); $res1=$splitLine[0]; $res2=$splitLine[2]; $distance=$res2-$res1; if($distance > 5){ $res_res=$res1."_".$res2; #if the residue pair is not found in the pdb map it is called FP (false positive) else TP (true positive) if((grep /^$res_res/, @res_res)==0){ print OUTFILE $line." FP\n"; }else{ print OUTFILE $line." TP\n"; } } }

close FILE; close OUTFILE; </source>

CN_dist2.R

CN_dist2.R is a R script that makes histograms for the CN-Score distribution (for all and extracted pairs). Furthermore it calculates the top L-Score (L = protein length) for each residue i that belongs to the top L CN_dist2.R script-call:
R CMD BATCH --slave '--args infile1=<FILE1> infile2=<FILE2> png_file=<OUTFILE1> output=<OUTFILE>' contact_map.R /dev/tty

-infile1         The evfold file with path.
-infile2         The extracted evfold file with path.
-png_file        PNG-file (.png) and the path, where the image of the multiple histogram with the CN-score frequencies 
                 for all and extracted residues should be saved.
-output          File-name and path, where the L-Scores should be stored.

Source code: <source lang="php">

  1. reads the input parameters

args=(commandArgs(trailingOnly=TRUE))

  1. go through the parameters

if(length(args)==0){

   	cat("\nScript-call:\nR CMD BATCH --slave '--args infile1=<FILE1> infile2=<FILE2> png_file=<OUTFILE1> output=<OUTFILE>' contact_map.R /dev/tty

\n\ninfile1\t\tThe evfold file with path. \ninfile2\t\tThe extracted evfold file with path. \npng_file\t\tPNG-file (.png) and the path, where the image of the multiple histogram with the CN-score frequencies for all and extracted residues should be saved. \noutput\t\tFile-name and path, where the L-Scores should be stored.\n\n") }else{

   for(i in 1:length(args)){
        eval(parse(text=argsi))
   }	

evfold<-read.table(infile1, sep=" ") evfold_extract<-read.table(infile2, sep=" ") out<-output pg<-png_file

difference <- evfold[,3]-evfold[,1] cn<-evfold[,6]

difference2 <- evfold_extract[,3]-evfold_extract[,1] cn2<-evfold_extract[,6] png(pg) par(mfrow=c(1,2)) hist(cn,col=("forestgreen"),xlab="CN",breaks = seq(-1.5,6.5, by = 0.1), main ="All pairs") hist(cn2,col=("purple"),xlab="CN",breaks = seq(-1.5,6.5, by = 0.1), main ="Extracted pairs") dev.off() #multiples Histogram with both CN score for all and for extracted residues

  1. library(plotrix)
  2. par(mfrow=c(1,1))
  3. l<-list(cn,cn2)
  4. png(pg)
  5. multhist(l,col=c("forestgreen","purple"),breaks=35,xlab="CN",ylab="frequency", main = "CN score frequency for all and extracted residue pairs")
  6. legend("topright",legend=c("ras","extracted"), fill=c("forestgreen","purple"))
  7. dev.off()

#L-Score for extracted pairs res1<-evfold_extract[,1] res2<-evfold_extract[,3] plength <-166 res <-1:plength sum <- data.frame(residue=res,score=0) #For all top-couplings sum the scores. Both residues must be considered. for(i in 1:plength){ sum[res1[i],2] <- sum[res1[i],2] + cn2[i] sum[res2[i],2] <- sum[res2[i],2] + cn2[i] } sum$score<-round(sum$score/mean(cn2[1:plength]),digits=2) sum <- sum[order(sum$score, decreasing = T),] write.table(sum[1:plength,], out , quote = F, row.names = F) } </source>

contact_map.R

Source code: contact_map.R is a R script that creates a contact map with the output-files of the two perl scripts above (pdb = reference structure, extracted = predicted).
contact_map.R script-call:
R CMD BATCH '--args infile1=<FILE1> infile2=<FILE2> tophits=<#pairs> output=<OUTFILE>' contact_map.R

-infile1         The sorted and extracted evfold file with path.
-infile2         The pdb-contact file (output of contact_map.pl) with path.
-tophits         number, how many of the best residue pairs should be represented in the contact map.
-output          File-name of the map and the path, where it should be stored. File must be a PNG-File (.png).

<source lang="php">

  1. reads the input parameters

args=(commandArgs(trailingOnly=TRUE))

  1. go through the parameters

if(length(args)==0){

   	cat("\nScript-call:\nR CMD BATCH '--args infile1=<FILE1> infile2=<FILE2> tophits=<#pairs> output=<OUTFILE>' contact_map.R

\n\ninfile1\t\tThe sorted and extracted evfold file with path. \ninfile2\t\tThe pdb-contact file (output of contact_map.pl) with path. \ntophits\t\t number, how many of the best residue pairs should be represented in the contact map. \noutput\t\tFile-name of the map and the path, where it should be stored. File must be a PNG-File (.png).\n\n") }else{

   for(i in 1:length(args)){
        eval(parse(text=argsi))
   }	

in1<-read.table(infile1, sep=" ") pdb<-read.table(infile2, sep="\t") npairs<-tophits out<-output

cres1<-c() cres2<-c() #only take the first n entries, where n is the number of residue pairs for(i in 1:npairs){ cres1[i]<-c(in1[i,1]) cres2[i]<-c(in1[i,3]) } #residue pairs of the evfold prediction res1 <- c(cres1,cres2) res2 <- c(cres2,cres1)

#residue pairs of the pdb structure l1<-c(pdb[,1],pdb[,2]) l2<-c(pdb[,2],pdb[,1])

png(out) plot(l1,l2, col="lightgrey",pch =16,ylim=rev(range(l1)),xlab="residue",ylab="residue",main=paste("Contact Map for",npairs, "pairs",sep=" ")) lines(res1,res2,type="p",col="red2",pch =16,ylim=rev(range(res1))) dev.off() } </source>