Sequence-based predictions Protocol TSD

From Bioinformatikpedia

Back to Task.

Secondary Structure

Get the sequences <source lang="bash">

  1. !/bin/bash

cd ../input

wget http://www.uniprot.org/uniprot/P10775.fasta wget http://www.uniprot.org/uniprot/Q9X0E6.fasta wget http://www.uniprot.org/uniprot/Q08209.fasta wget http://www.uniprot.org/uniprot/P06865.fasta </source>

For DSSP first get the executable <source lang="bash"> wget ftp://ftp.cmbi.ru.nl/pub/software/dssp/dssp-2.0.4-linux-amd64 chmod +x dssp-2.0.4-linux-amd64 </source>

Get the PDB files for the according Uniprot entries <source lang="bash">

  1. !/bin/bash

cd ../input

wget http://www.pdb.org/pdb/files/2BNH.pdb wget http://www.pdb.org/pdb/files/1KR4.pdb wget http://www.pdb.org/pdb/files/1AUI.pdb wget http://www.pdb.org/pdb/files/2GJX.pdb </source>

Start the predictions <source lang="bash">

  1. !/bin/bash

cd ../input/

for file in `ls | grep .fasta` ; do

       reprof -i $file -o ../prediction/

done

for file in `ls | grep .pdb` ; do

       ./../bin/dssp-2.0.4-linux-amd64 -i $file -o ../prediction/$file.dssp

done </source>


For PSIPred use the webserver


Make ReProf output more readable <source lang="bash">

  1. !/bin/bash

cd ../prediction/

for file in `ls *.reprof` ; do

       grep -v -P "^(#|No)" $file | cut -f 2 | tr -d '\n' > $file.parsed
       echo "" >> $file.parsed
       grep -v -P "^(#|No)" $file | cut -f 3 | tr -d '\n' | tr 'L' 'C' >> $file.parsed
       echo "" >> $file.parsed

done </source>

Make DSSP output more readable <source lang="bash">

  1. !/bin/bash

cd ../prediction/

for file in `ls *.dssp` ; do

       tail -n+29 $file | cut -c14 | tr -d '\n' > $file.parsed #Thanks to Jonathan
       echo "" >> $file.parsed
       tail -n+29 $file | cut -c17 | tr ' ' '-' | tr -d '\n' | tr 'HGIEBTS-' 'HHHEECCC' >> $file.parsed
       echo "" >> $file.parsed

done </source>

Make PsiPred Output more readable <source lang="bash">

  1. !/bin/bash

cd ../prediction/

for file in `ls *.psipred | grep -v "pdf"` ; do

       grep "AA:" $file | sed -r 's/\s+AA: //' | tr -d '\n' > $file.parsed
       echo "" >> $file.parsed
       grep "Pred:" $file | sed 's/Pred: //' | tr -d '\n' >> $file.parsed
       echo "" >> $file.parsed

done </source>

Figures

Manually align the sequences by addin gaps ('-'). Create one file with the aligned sequences ('_seq.combined') and one with the structures, not containing any whitespace or gaps ('_struct.combined'). Take care of DSSPs lower case letter notation for disulfide bridges. Feed both files into cpssp and afterwards run latex, to create the picture. <source lang="bash">

  1. !/bin/bash

ids=( P06865 P10775 Q9X0E6 Q08209)

for i in "${ids[@]}" do /home/jonas/texmf/tex/latex/cpssp/cpssp -s ${i}_seq.combined -u ${i}_struct.combined -i 2 -o $i -b .2 done

  1. Adjust the latex files

for i in "${ids[@]}" do pdflatex ${i}_plot.tex pdftocairo -png ${i}_plot.pdf done </source>

Disorder

Get the required sequences <source lang="bash">

  1. !/bin/bash

cd ../input

wget http://www.uniprot.org/uniprot/P10775.fasta wget http://www.uniprot.org/uniprot/Q9X0E6.fasta wget http://www.uniprot.org/uniprot/Q08209.fasta wget http://www.uniprot.org/uniprot/P06865.fasta </source>

Start the predictions <source lang="bash">

  1. !/bin/bash

cd /opt/iupred/ END=.iupred

for file in `ls /mnt/home/student/reeb/3_SeqBasedPred/2_DISO/input | grep .fasta` ; do

       IFS="."
       array=($file)
       unset IFS
       ./iupred /mnt/home/student/reeb/3_SeqBasedPred/2_DISO/input/$file long > /mnt/home/student/reeb/3_SeqBasedPred/2_DISO/predictions/${array[0]}$END

done

</source>

Transmembrane helices

Get the required sequence and our reference sequence <source lang="bash"> cd ../input/

wget http://www.uniprot.org/uniprot/P35462.fasta wget http://www.uniprot.org/uniprot/Q9YDF8.fasta wget http://www.uniprot.org/uniprot/P47863.fasta wget http://www.uniprot.org/uniprot/P06865.fasta

wget http://www.uniprot.org/uniprot/P02768.fasta wget http://www.uniprot.org/uniprot/P47863.fasta wget http://www.uniprot.org/uniprot/P11279.fasta </source>

Script for running polyphobius and creating everything needed in advance <source lang="bash">

  1. !/bin/bash
  2. $ -S /bin/sh


BLASTDB=$1 #/mnt/project/pracstrucfunc12/data/swissprot/uniprot_sprot BLASTINDEX=$2 #/mnt/project/pracstrucfunc12/data/index_pp/uniprot_sprot.idx WD=$3 OUT=$4 EXEC=/mnt/project/pracstrucfunc12/polyphobius/jphobius EXECBG=/mnt/project/pracstrucfunc12/polyphobius/blastget EXECKA=/mnt/opt/T-Coffee/bin/kalign END=.pred ENDBG=.bg ENDKA=.msa PARAMS=-poly PARAMSKA="-f fasta" PARAMSBG="-db $BLASTDB -ix $BLASTINDEX"

PATH=$PATH:/mnt/project/pracstrucfunc12/polyphobius/ export PATH


mkdir -p $OUT

cd $WD

pwd

`rm $OUT/log &> /dev/null`


for file in `ls | grep ".fasta"`; do

   echo "Processing $file" &>> $OUT/log
   IFS="."
   array=($file)
   unset IFS
   
   `perl $EXECBG $PARAMSBG $file > $OUT/${array[0]}$ENDBG`

wait

if [ `grep "^>" $OUT/${array[0]}$ENDBG | wc -l` -gt 1 ]; then

   	`$EXECKA $PARAMSKA -input $OUT/${array[0]}$ENDBG -output $OUT/${array[0]}$ENDKA`

wait

   	`perl $EXEC $PARAMS $OUT/${array[0]}$ENDKA &> $OUT/${array[0]}$END`

wait else

`perl $EXEC $PARAMS $OUT/${array[0]}$ENDBG &> $OUT/${array[0]}$END` fi done </source>

Start the predictions

<source lang="bash"> ./callPolyPhobius.sh /mnt/project/pracstrucfunc12/data/swissprot/uniprot_sprot /mnt/project/pracstrucfunc12/data/index_pp/uniprot_sprot.idx ../input/ ../prediction/sp/ </source>

Signal peptides

<source lang="bash">

  1. !/bin/bash

for file in /mnt/home/student/reeb/3_SeqBasedPred/4_SIGP/input/*fasta; do

       prot=${file##*/}
       protein=${prot%.*}
       signalp -t euk -graphics gif -d /mnt/home/student/reeb/3_SeqBasedPred/4_SIGP/prediction_v3/gif_$protein -trunc 70 $file > /mnt/home/student/reeb/3_SeqBasedPred/4_SIGP/prediction_v3/$protein.out

done

</source>

GO terms

Start the predictions for the methods by going to their webservers. For GOPet the most recent model, program version and database were used. We also incresed the maximum number of reported GO-Terms to the maxmimum of 100.

R Plots

GO.prob <- c(8.3,10.5,0.1,1.0,2.4,1.8,0.2,1.0,5.8,2.6,4.4,1.4,0.5,0.9)
func.prob <-c(16.1,33.2,80.4,11.0,43.2,11.3,1.9,51.9,1.8,7.3,4.0,68.5)
GO.names <- c( "Signal_transducer","Receptor","Hormone","Structural protein","Transporter","Ion channel","Voltage-gated ion channel","Cation  channel","Transcription","Transcription regulation","Stress response","Immune response","Growth factor","Metal ion transport" )
func.names <-c("Amino acid biosynthesis","Biosynthesis of cofactors","Cell envelope","Cellular processes","Central intermediary  metabolism","Energy metabolism","Fatty acid metabolism","Purines and pyrimidines","Regulatory functions","Replication and transcription","Translation ","Transport and binding")
main1<-"ProtFun2.2 GO prediction"
main2<-"ProtFun2.2 functional category prediction"
col1<-c("darkblue")
col2<-c("darkblue","darkblue", "darkred","darkblue","darkblue","darkblue", "darkblue","darkblue","darkblue", "darkblue","darkblue","darkblue")
png("protfunGO.png")
par(mar = c(11, 4, 4, 2) + 0.1)
barplot(GO.prob,names=GO.names,las=2,main=main1,beside=T,ylab="Probanility %",col=col1,cex.main=1.5)
legend('topright', c("Selected category", " Remaining categories"), fill=c("darkred","darkblue"), inset=c(0.1, 0.1))
dev.off()
png("protfunFuncCat.png")
par(mar = c(13.5, 4, 4, 2) + 0.1)
barplot(func.prob,names=func.names,las=2,main=main2,beside=T,ylab="Probanility %",col=col2,cex.main=1.5, ylim=c(0,90))
legend('topright', c("Selected category", " Remaining categories"), fill=c("darkred","darkblue"), inset=c(0.1, 0.0))
dev.off()

Pfam

Clan statistics

<source lang="bash"> wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz gunzip Pfam-A.clans.tsv.gz cut -f 2 Pfam-A.clans.tsv | sort | uniq -c | sed "s/CL[0-9]\+//g" | tr -d ' ' | sed '$d' > temp

  1. Enter R

> a<- read.table("temp") > summary(a)

      V1         
Min.   :  1.000  
1st Qu.:  2.000  
Median :  4.000  
Mean   :  8.503  
3rd Qu.:  8.000  
Max.   :194.000  

</source>