Difference between revisions of "Lab Journal of Task 3 (MSUD)"

From Bioinformatikpedia
(Secondary structure)
(Secondary structure)
Line 1: Line 1:
 
For task 3 we have used the [[BCKDHA#Protein|reference sequence of BCKDHA]] and other given example proteins.
 
For task 3 we have used the [[BCKDHA#Protein|reference sequence of BCKDHA]] and other given example proteins.
 
== Secondary structure ==
 
== Secondary structure ==
  +
  +
=== Prediction and assignment ===
   
 
* PSSMs were created with Psi-Blast:
 
* PSSMs were created with Psi-Blast:
Line 26: Line 28:
 
* To parse the output of ReProf, DSSP and PsiPred, we used [[Phenylketonuria/Task3/Scripts#Secondary_Structure|Sonja's script]].
 
* To parse the output of ReProf, DSSP and PsiPred, we used [[Phenylketonuria/Task3/Scripts#Secondary_Structure|Sonja's script]].
   
  +
=== Evaluation of prediction approaches ===
* The ReProf predictions were compared with the DSSP assignment with a Python script (located at <code>/mnt/home/student/schillerl/MasterPractical/task3/compare_secstr.py</code>), which calculates the recall, precision and f-measure of the predictions. Positions that are assigned as "loop or irregular" by DSSP (translated into a '-' by Sonja's script) were considered as loop for the calculation.
 
  +
  +
* The ReProf predictions were compared with the DSSP assignment with the following Python script (located at <code>/mnt/home/student/schillerl/MasterPractical/task3/evaluate_secstr_reprof.py</code>), which calculates the recall, precision and f-measure of the predictions. Positions that are assigned as "loop or irregular" by DSSP (translated into a '-' by Sonja's script) were considered as loop for the calculation.
   
 
Recall and Precision are defined as follows:
 
Recall and Precision are defined as follows:
Line 37: Line 41:
   
 
where TP means true positive, FP false positive and FN false negative.
 
where TP means true positive, FP false positive and FN false negative.
  +
  +
  +
<source lang="python">
  +
dssp_file = open("./dssp/P10775_secstr.txt")
  +
dssp = dssp_file.readline()
  +
dssp_file.close()
  +
  +
# interpret '-' as loop in dssp secondary structure
  +
dssp = dssp.replace("-", "L")
  +
  +
for reprof_run in ["./reprof/P10775_secstr.txt", "./reprof/P10775_big80_secstr.txt", "./reprof/P10775_SwissProt_secstr.txt"]:
  +
reprof_file = open(reprof_run)
  +
reprof = reprof_file.readline()
  +
reprof_file.close()
  +
  +
assert len(dssp) == len(reprof)
  +
  +
sum1 = {'E': 0, 'H': 0, 'L': 0, 'all': 0}
  +
sum2 = {'E': 0, 'H': 0, 'L': 0, 'all': 0}
  +
found = {'E': 0, 'H': 0, 'L': 0, 'all': 0}
  +
right = {'E': 0, 'H': 0, 'L': 0, 'all': 0}
  +
  +
for i in range(len(dssp)):
  +
for secstr in ['E', 'H', 'L']:
  +
if dssp[i] == secstr:
  +
sum1[secstr] += 1
  +
if reprof[i] == secstr:
  +
found[secstr] += 1
  +
if reprof[i] == secstr:
  +
sum2[secstr] += 1
  +
if dssp[i] == secstr:
  +
right[secstr] += 1
  +
  +
for sum in [sum1, sum2, found, right]:
  +
sum['all'] = sum['E'] + sum['H'] + sum['L']
  +
  +
recall = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0}
  +
precision = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0}
  +
  +
print "-----------"
  +
print "%s:" % reprof_run
  +
print "-----------"
  +
  +
for secstr in ['H', 'E', 'L', 'all']:
  +
recall[secstr] = (float(found[secstr]) / sum1[secstr])
  +
print "Recall for %s: %f" % (secstr, recall[secstr])
  +
precision[secstr] = (float(right[secstr]) / sum2[secstr])
  +
print "Precision for %s: %f" % (secstr, precision[secstr])
  +
print "F-measure for %s: %f" % (secstr, (2 * precision[secstr] * recall[secstr] / (precision[secstr] + recall[secstr])))
  +
</source>
   
 
== Disordered protein ==
 
== Disordered protein ==

Revision as of 15:18, 16 May 2013

For task 3 we have used the reference sequence of BCKDHA and other given example proteins.

Secondary structure

Prediction and assignment

  • PSSMs were created with Psi-Blast:

blastpgp -d /mnt/project/pracstrucfunc13/data/big/big_80 -i P10775.fasta -j 2 -h 10e-10 -Q P10775_big80.blastPsiMat

blastpgp -d /mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot -i P10775.fasta -j 2 -h 10e-10 -Q P10775_SwissProt.blastPsiMat

  • ReProf was run for P10775 with a simple fasta file and with a PSSM (generated with big_80 and SwissProt, respectively) as input:

reprof -i P10775.fasta

reprof -i P10775_big80.blastPsiMat

reprof -i P10775_SwissProt.blastPsiMat

  • The pdb files used as input for DSSP are located at /mnt/home/student/schillerl/MasterPractical/task3/pdb_structures/. If there were more than one pdb structures available those which the highest coverage over the sequence and with the highest resolution were taken preferentially. These structures were used: 2BFD (P12694), 2BNH (P10775), 1AUI (Q08209) and 1KR4 (Q9X0E6).
  • To parse the output of ReProf, DSSP and PsiPred, we used Sonja's script.

Evaluation of prediction approaches

  • The ReProf predictions were compared with the DSSP assignment with the following Python script (located at /mnt/home/student/schillerl/MasterPractical/task3/evaluate_secstr_reprof.py), which calculates the recall, precision and f-measure of the predictions. Positions that are assigned as "loop or irregular" by DSSP (translated into a '-' by Sonja's script) were considered as loop for the calculation.

Recall and Precision are defined as follows:

  • recall = TP / (TP + FN)
  • precision = TP / (TP + FP)
  • f-measure = 2 * recall * precision / (recall + precision)

where TP means true positive, FP false positive and FN false negative.


<source lang="python"> dssp_file = open("./dssp/P10775_secstr.txt") dssp = dssp_file.readline() dssp_file.close()

  1. interpret '-' as loop in dssp secondary structure

dssp = dssp.replace("-", "L")

for reprof_run in ["./reprof/P10775_secstr.txt", "./reprof/P10775_big80_secstr.txt", "./reprof/P10775_SwissProt_secstr.txt"]: reprof_file = open(reprof_run) reprof = reprof_file.readline() reprof_file.close()

assert len(dssp) == len(reprof)

sum1 = {'E': 0, 'H': 0, 'L': 0, 'all': 0} sum2 = {'E': 0, 'H': 0, 'L': 0, 'all': 0} found = {'E': 0, 'H': 0, 'L': 0, 'all': 0} right = {'E': 0, 'H': 0, 'L': 0, 'all': 0}

for i in range(len(dssp)): for secstr in ['E', 'H', 'L']: if dssp[i] == secstr: sum1[secstr] += 1 if reprof[i] == secstr: found[secstr] += 1 if reprof[i] == secstr: sum2[secstr] += 1 if dssp[i] == secstr: right[secstr] += 1

for sum in [sum1, sum2, found, right]: sum['all'] = sum['E'] + sum['H'] + sum['L']

recall = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0} precision = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0}

print "-----------" print "%s:" % reprof_run print "-----------"

for secstr in ['H', 'E', 'L', 'all']: recall[secstr] = (float(found[secstr]) / sum1[secstr]) print "Recall for %s: %f" % (secstr, recall[secstr]) precision[secstr] = (float(right[secstr]) / sum2[secstr]) print "Precision for %s: %f" % (secstr, precision[secstr]) print "F-measure for %s: %f" % (secstr, (2 * precision[secstr] * recall[secstr] / (precision[secstr] + recall[secstr]))) </source>

Disordered protein

IUPred

  • Predictions were performed through the web server of IUPred. Graphical profiles of the results were downloaded.
  • Output of IUPred are stored in the directory /mnt/home/student/weish/master-practical-2013/task03/02-disordered-protein/iupred
  • We have also performed the prediction from command-line, following is the bash script:

<source lang="bash">

  1. !/bin/sh -e

INPUT=$HOME/master-practical-2013/task03 OUTPUT=$HOME/master-practical-2013/task03/02-disordered-protein/iupred PARAMS="long short glob"

if [ ! -d $OUTPUT ]; then

       mkdir $OUTPUT

fi

for seq in $INPUT/*.fasta do

       filename=`basename $seq`
       for param in $PARAMS
       do
               iupred $seq $param > $OUTPUT/iupred_${filename}_$param.tsv
       done

done </source>

MetaDisorder(MD)

  • As the man page of metadisorder describes, the prediction of disordered region is based on the results of other programs such as NORSnet, PROFbval etc. Rather than directly call metadisorder we have used the wrapper program predictprotein as is described on the exercise page.
  • Comparison to DisProt database: TODO


Following script was called for the task: <source lang="bash">

  1. !/bin/sh -e

INPUT=$HOME/master-practical-2013/task03 OUTPUT=$HOME/master-practical-2013/task03/metadisorder EXE=predictprotein

  1. make output directory

if [ ! -d $OUTPUT ]; then

       mkdir $OUTPUT

fi

  1. call metadisorder for all query sequences

for seq in $INPUT/*.fasta do

       filename=`basename $seq`
       $EXE --seqfile $seq --target metadisorder -p metadisorder_$filename \
       -o $OUTPUT

done echo Done! </source>

Transmembrane helices

Signal peptides

GO terms