Difference between revisions of "Sequence Alignments Protocol TSD"
m (→COPS Parsing) |
|||
Line 96: | Line 96: | ||
Also be aware that ignoring chains might lead to a merge of clusters. That is if given xxxx als protein of interest and xxxx,A as well was xxxx,B are contained in COPS, the final cluster which is used to calculate how many hits fall into it, will be a combination of the cluster from xxxx,A and the one from xxxx,B. That of course might be overestimating the number of hits that fall into a cluster. |
Also be aware that ignoring chains might lead to a merge of clusters. That is if given xxxx als protein of interest and xxxx,A as well was xxxx,B are contained in COPS, the final cluster which is used to calculate how many hits fall into it, will be a combination of the cluster from xxxx,A and the one from xxxx,B. That of course might be overestimating the number of hits that fall into a cluster. |
||
+ | Sample call: perl parseCOPSClusters.pl -c ~/Desktop/COPS-ChainHierarchy.txt -i ./interestIds -nc -hi ./union.finpdb |
||
− | Sample call: |
||
#!/usr/bin/env perl |
#!/usr/bin/env perl |
Revision as of 12:07, 6 May 2012
Contents
Blast
blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -b 1200 -v 1200 > /mnt/home/student/meiera /1_SeqAli/1_SeqSearch/blastall_1200alis.out
PSI-Blast
Big80
time blastpgp -C blastpgp1200_pssmdefault -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -v 1200 -b 1200 > /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/blastpgp1200_default.out time blastpgp -C blastpgp1200_pssmit2e002 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -j 2 -h "0.002" -v 1200 -b 1200 > /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/blastpgp1200_it2e002.out time blastpgp -C blastpgp1200_pssmit10e002 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -j 10 -h "0.002" -v 1200 -b 1200 > /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/blastpgp1200_it10e002.out time blastpgp -C blastpgp1200_pssmit10e-10 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -j 10 -h "10E-10" -v 1200 -b 1200 > /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/blastpgp1200_it10e-10.out time blastpgp -C blastpgp1200_pssmit2e-10 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/P06865.fasta -j 2 -h "10E-10" -v 1200 -b 1200 > /mnt/home/student/meiera/1_SeqAli/1_SeqSearch/blastpgp1200_it2e-10.out
Big
PAT="/mnt/home/student/meiera/1_SeqAli/1_SeqSearch" time blastpgp -R $PAT/blastpgp1200_pssmit2e002 -d /mnt/project/pracstrucfunc12/data/big/big -i $PAT/P06865.fasta -v 3800 -b 3800 > $PAT/blastpgpBIG_3800_it2e002.out time blastpgp -R $PAT/blastpgp1200_pssmit10e002 -d /mnt/project/pracstrucfunc12/data/big/big -i $PAT/P06865.fasta -v 3800 -b 3800 > $PAT/blastpgpBIG_3800_it10e002.out time blastpgp -R $PAT/blastpgp1200_pssmit2e-10 -d /mnt/project/pracstrucfunc12/data/big/big -i $PAT/P06865.fasta -v 3800 -b 3800 > $PAT/blastpgpBIG_3800_it2e-10.out time blastpgp -R $PAT/blastpgp1200_pssmit10e-10 -d /mnt/project/pracstrucfunc12/data/big/big -i $PAT/P06865.fasta -v 3800 -b 3800 > $PAT/blastpgpBIG_3800_it10e-10.out
Find Unique matches: example
head -1 m8blastpgp1200_it2e002.out grep -n G1RUL9 m8blastpgp1200_it2e002.out tail -n +1233 m8blastpgp1200_it2e002.out | cut -f 2 | wc -l tail -n +1233 m8blastpgp1200_it2e002.out | uniq -w 44 | wc -l
HHblits
WD="/mnt/home/student/meiera/1_SeqAli/1_SeqSearch" time hhblits -i $WD/P06865.fasta -o $WD/hhblits_460_P06865.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -Z 460 -B 460 > $WD/hhblits_460_P06865_stdout.log time hhblits -i $WD/P06865.fasta -n 10 -o $WD/hhblits_460_P06865_10it.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -Z 460 -B 460 > $WD/hhblits_460_P06865_10_stdout.log time hhblits -i $WD/P06865.fasta -n 10 -o $WD/hhblits_2500_P06865_10it.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -Z 2500 -B 2500 > $WD/hhblits_2500_P06865_10_stdout.log time hhblits -i $WD/P06865.fasta -n 2 -o $WD/hhblits_460_P06865_2it.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -Z 460 -B 460 > $WD/hhblits_460_P06865_2_stdout.log time hhblits -i $WD/P06865.fasta -n 2 -o $WD/hhblits_2500_P06865_2it.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -Z 2500 -B 2500 > $WD/hhblits_2500_P06865_2_stdout.log time hhsearch -i $WD/P06865.fasta -o $WD/hhblits_460_P06865.hhr -d /mnt/project/pracstrucfunc12/data/hhblits/pdb70_current_hhm_db -Z 14 -B 14 > $WD/hhblits_pdb_1500_P06865_stdout.log
Preparation of outputs
Sequence identity and e-values
#for both Blast and HHblits output /mnt/home/student/meiera/bin/1_parseidentity.pl Usage: 1_parseidentity.pl [flags] file... flags: -i Infile reads the input from fileIN -o OutFile writes output to fileOUT -h Input is hhblits, default is blast
Uniprot sets for Venn diagrams
#for both Blast and HHblits output, also computation of the unique ids for HHblits /mnt/home/student/meiera/bin/1_getUniprotIDS.pl Usage: 1_getUniprotIDS.pl [flags] file... flags: -i Infile reads the input from fileIN -o OutFile writes uniprot identifiers to fileOUT -h = hhblits input default is set to blastoutput -m mappingfile uniprot mappingfile
Build PDB sets
Unique identifiers for Blast and HHblits
/mnt/home/student/meiera/bin/1_forPdbMapping.pl Usage: 1_forPdbMapping.pl [flags] file... flags: -i Infile reads the input from fileIN -o OutFile writes uniprot identifiers to fileOUT -h = hhblits input, default is set to blastoutput
#Use uniprot.org's online mapping service on the .pdbmapping files and save the results to .mapped files
cut -f 2 psiblastBIGit10e002.mapped | tr '[A-Z]' '[a-z]' | sort | uniq | grep -P "\w{4}" > psiblastBIGit10e002.finpdb1 grep -P "\w{4}_" psiblastBIGit10e002.pdbmapping --only-matching | sed "s/_//" > psiblastBIGit10e002.finpdb2 cat psiblastBIGit10e002.finpdb2 psiblastBIGit10e002.finpdb1 | sort | uniq > psiblastBIGit10e002.finpdb
cut -f 2 psiblastBIGit10e-10.mapped | tr '[A-Z]' '[a-z]' | sort | uniq | grep -P "\w{4}" > psiblastBIGit10e-10.finpdb1 grep -P "\w{4}_" psiblastBIGit10e-10.pdbmapping --only-matching | sed "s/_//" > psiblastBIGit10e-10.finpdb2 cat psiblastBIGit10e-10.finpdb1 psiblastBIGit10e-10.finpdb2 | sort | uniq > psiblastBIGit10e-10.finpdb
cut -f 2 psiblastBIGit2e002.mapped | tr '[A-Z]' '[a-z]' | sort | uniq | grep -P "\w{4}" > psiblastBIGit2e002.finpdb1 grep -P "\w{4}_" psiblastBIGit2e002.pdbmapping --only-matching | sed "s/_//" > psiblastBIGit2e002.finpdb2 cat psiblastBIGit2e002.finpdb1 psiblastBIGit2e002.finpdb2 | sort | uniq > psiblastBIGit2e002.finpdb
cut -f 2 psiblastBIGit2e-10.mapped | tr '[A-Z]' '[a-z]' | sort | uniq | grep -P "\w{4}" > psiblastBIGit2e-10.finpdb1 grep -P "\w{4}_" psiblastBIGit2e-10.pdbmapping --only-matching | sed "s/_//" > psiblastBIGit2e-10.finpdb2 cat psiblastBIGit2e-10.finpdb1 psiblastBIGit2e-10.finpdb2 | sort | uniq > psiblastBIGit2e-10.finpdb
cut -f 2 hhblits__460.mapped | tr '[A-Z]' '[a-z]' | sort | uniq | grep -P "\w{4}" > hhbtemp sed "s/_\w//g" hhblits_pdb_1500.pdbmapping > hhbtemp2 cat hhbtemp hhbtemp2 | sed "s/\s//g" | sort | uniq > hhblits.finpdb rm -f hhbtemp hhbtemp2 # dont' have to do the special stuff here, the normal one doesn't find any because it's only on uniprot
COPS Parsing
The following script will read the COPS clusters (-c) and print out for every classification the members of the cluster that one or several IDs of interest (-i) are contained in. Additionally, if given a list of hits (-hi), it will count the number of hits that fall in the according cluster and the ones that don't and print out that information to a file for plotting e.g. with R. Note that you can ignore chains with the '-nc' option.
Also be aware that ignoring chains might lead to a merge of clusters. That is if given xxxx als protein of interest and xxxx,A as well was xxxx,B are contained in COPS, the final cluster which is used to calculate how many hits fall into it, will be a combination of the cluster from xxxx,A and the one from xxxx,B. That of course might be overestimating the number of hits that fall into a cluster.
Sample call: perl parseCOPSClusters.pl -c ~/Desktop/COPS-ChainHierarchy.txt -i ./interestIds -nc -hi ./union.finpdb
#!/usr/bin/env perl use strict; use warnings; use sigtrap; use autodie; use diagnostics; use Pod::Usage; use Getopt::Long qw(:config require_order); use 5.010; #For smart match, which makes this script unbelievably slow, but I'm lazy :) ##Program variables my %clusters; my @clusterOrder; my @interestIDs = (); my %interestIDLocations; my @hitsIDs = (); ##Argument my $copsPath = '/mnt/project/pracstrucfunc12/data/COPS/COPS-ChainHierarchy.txt'; my $nochain = 0; my $interestPath = './interestIds'; my $hitsPath = './union.finpdb'; ######################## ###Argument Handling#### ######################## my $result = GetOptions( 'help|h|?' => sub { pod2usage( -verbose => 1 ); }, 'man|m' => sub { pod2usage( -verbose => 2 ); }, 'c|cops=s' => \$copsPath, 'nc|nocain' => \$nochain, 'i|interest=s' => \$interestPath, 'hi|hits=s' => \$hitsPath ) or pod2usage( -verbose => 1 ) && exit; #Read IDs to look out for my $fhi; open( $fhi, '<', $interestPath ); while ( my $line = <$fhi> ) { if ( $line =~ m/(\w{4},\w)/ || $line =~ m/(\w{4})/ ) { my $iid = lc($1); push( @interestIDs, $iid ); $interestIDLocations{$iid} = {}; } } close($fhi); #Read ids that were found by prediction method(s) my $fhh; open($fhh, '<', $hitsPath); while(my $line =<$fhh>) { if($line =~ m/^(\w{4})$/) { push(@hitsIDs, lc($1)); } } close($fhh); print "Found # hits: " . scalar(@hitsIDs) . "\n"; #Parse the COPS structure my $fh; open( $fh, '<', $copsPath ); while ( my $line = <$fh> ) { if ( $line =~ m/^Chain (.+)$/ ) { #Header line found. Inititialize the cluster hash my @matches = $1 =~ m/\w+/g; for my $clusterName (@matches) { $clusters{$clusterName} = {}; } @clusterOrder = @matches; } elsif ( $line =~ m/\w{4},/ ) { #Cluster info line my @matches = $line =~ m/(\w{4},\w)/g; my $currentId = shift(@matches); # print "current id is $currentId\n"; $currentId = sanitizePdbIDs($currentId); my $count = 0; for my $representative (@matches) { $representative = sanitizePdbIDs($representative); my $currentCluster = $clusterOrder[ $count++ ]; # print "currentCluster is $currentCluster\n"; if ( !exists $clusters{$currentCluster} || !defined $clusters{$currentCluster} ) { print "this should not have happened :)\n"; next; } if ( !exists $clusters{$currentCluster}{$representative} ) { #Inititialize the cluster with the representative $clusters{$currentCluster}{$representative} = [$representative]; } #Add the current pdb id if ( !( $currentId ~~ @{ $clusters{$currentCluster}{$representative} } ) ) { #This case can only occur when we do not use the chains push( @{ $clusters{$currentCluster}{$representative} }, $currentId ); } if ( $currentId ~~ @interestIDs ) { if(!defined $interestIDLocations{$currentId}{$currentCluster}) { $interestIDLocations{$currentId}{$currentCluster}= (); } push(@{$interestIDLocations{$currentId}{$currentCluster} }, $representative); } } } } close($fh); #Do some maybe interesting output for my $intId (@interestIDs) { printClustersForId($intId); } sub printClustersForId { my $id = shift; my @allIn = (); my @allOut = (); for my $cluster ( sort(keys( $interestIDLocations{$id} ) ) ) { my %clusterMembersMerged = (); print "MEMBERS: $id -- $cluster : "; my @currentRepresentatives = @{$interestIDLocations{$id}{$cluster}}; for my $currentRepresentative (@currentRepresentatives) { my @clusterMembers = @{ $clusters{$cluster}{$currentRepresentative} }; for my $member (@clusterMembers) { # print "$member "; $clusterMembersMerged{$member} = ; } # print "\n"; } my @clusterMembersMergedArr= keys(%clusterMembersMerged); for my $e (@clusterMembersMergedArr) { print $e." "; } print "\n"; my $hitsIn = 0; my $hitsOut = 0; print "OVERLAPS: $id -- $cluster : "; for my $hit (@hitsIDs) { if($hit ~~ @clusterMembersMergedArr) { $hitsIn++; } else { $hitsOut++; } } print "In: $hitsIn, Out: $hitsOut\n"; push(@allIn, $hitsIn); push(@allOut, $hitsOut); } #Write Out In/Out over Cluster stats for plotting in R my $fho; open($fho, '>', "stats$id.plotme"); for my $cluster (sort (keys( $interestIDLocations{$id}))) { print $fho "\t$cluster"; } print $fho "\n"; print $fho "IN"; for my $elem (@allIn) { print $fho "\t$elem"; } print $fho "\n"; print $fho "OUT"; for my $elem (@allOut) { print $fho "\t$elem"; } print $fho "\n"; close($fho); } sub sanitizePdbIDs { my $id = shift; if ($nochain) { $id = substr( $id, 0, 4, ); } return $id; }