Difference between revisions of "Sequence-based predictions (PKU)"

From Bioinformatikpedia
(results)
(PsiPred)
Line 19: Line 19:
 
Webserver [http://bioinf.cs.ucl.ac.uk/psipred/ here]
 
Webserver [http://bioinf.cs.ucl.ac.uk/psipred/ here]
 
grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc
 
grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc
  +
  +
[[File:PAH psipred 1.png|200px|thumb|right|Visual output of PsiPred for PheOH, including confidence for all positions (part 1)]]
  +
[[File:PAH psipred 2.png|200px|thumb|right|Visual output of PsiPred for PheOH, including confidence for all positions (part 2)]]
   
 
===DSSP===
 
===DSSP===

Revision as of 10:49, 15 May 2012

In this task we will find out, how much information one can get from the plain sequence of our disease. We try to pretend we know nothing but the freshly sequenced primary-structure. We will perform predictions which are available as a web service as well as local programs to determine the functions and structure of out disease. Afterwards we will analyse the results with our prior knowledge about Phenylketonuria. In this manner we hope to get an idea how reliable the information derived from such tools is.

Short Task Description

Our task is to use the primary sequence of our protein (and some other example proteins) to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced like GO terms and similar functional annotations with different tools. The used commands and programms are listed at the appropriate places (if short and interesting enough) or linked at their own site. For a more detailed instruction please read.. and while you're at it, read some more

Secondary Structure Prediction

In this section we show a comparison of several secondary-structure-prediction-tools. In order to have the gold standard we will analyse the results with the information from DSSP.<ref name=DSSP> Kabsch W, Sander C (1983) Biopolymers. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, 22, 2577-2637.</ref><ref name=DSSPDB> Joosten RP, Te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) NAR. A series of PDB related databases for everyday needs.</ref>

ReProfSeq

reprof -i <uniprotID>.fasta
egrep -v "^#|^No" <uniprotID>.reprof |awk '{print $3}'|tr -d '\n' > <uniprotID>_reprof.secstruc

PsiPred

Webserver here

grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc
Visual output of PsiPred for PheOH, including confidence for all positions (part 1)
Visual output of PsiPred for PheOH, including confidence for all positions (part 2)

DSSP

Download dssp here and get PDB files matching the Uniprot entries

dssp -i <PDBID>.pdb > <PDBID>.dssp
tail -n+29 <PDBID>.dssp |cut -c17|tr ' ' '-'|tr -d '\n' > <PDBID>_dssp.secstruc

PDB-Files contains only part of the structure or more than 1 chain! e.g. 117aa-424aa for PAH.

Aligned structure to sequence manually..

Script to calculate Q3 and SOV: ss_score.py

results

The following tables show Q (percentage of true positives) and SOV (segment overlap) scores for the predictions of PsiPred and ReProfSeq compared to the structure given by dssp. The output has been mapped to a simple three state model of E,H and L since dssp uses a more complex eight state model. It seems clear, that PsiPred outperforms ReProfSeq in all categories, especially the Q_E of ReProfSeq is very low (in this very small dataset). To confirm that this is not an error in the calculation, we compared the structure predictions manually and ReProfSeq indeed misses many sheet structures, predicts helices as sheets and vice versa so there is not even a partial success to declare. PsiPred performs well in all types of structure but the worst results are also in the prediction of sheets. It is interesting that the protein with the best Q_E for PsiPred is the one with the worst Q_E for ReProfSeq and the protein with the best Q_3 for ReProfSeq is the one with the (very closely) second to worst Q_E for PsiPred.

Mapping between the different models:

E = E (dssp), E (reprof), E (psipred)
H = HGI (dssp), H (reprof), H (psipred)
L = BSTL (dssp), L (reprof), C (psipred)
Protein ReprofSeq
UniprotID Q_E Q_H Q_L Q_3 SOV_E SOV_H SOV_L SOV
P00439 45.5 75.7 74.8 71.2 47.5 86.7 71.8 76.4
P10775 23.1 71.9 62.0 61.8 21.9 87.2 66.9 70.6
Q9X0E6 26.8 91.9 39.1 53.5 36.1 91.1 41.0 57.6
Q08209 69.1 34.0 73.1 57.9 62.4 45.6 76.7 63.0
Protein PsiPred
UniprotID Q_E Q_H Q_L Q_3 SOV_E SOV_H SOV_L SOV
P00439 57.6 85.1 89.0 83.8 57.6 88.5 54.4 71.1
P10775 92.3 90.3 94.7 92.5 92.7 95.5 95.7 95.2
Q9X0E6 75.6 86.5 78.3 80.2 93.44 100.0 94.5 96.1
Q08209 58.2 77.3 90.7 81.0 64.1 85.8 64.8 72.5

Different predictions and reference from Uniprot for the structure of PheOH:

  UniProt: ------------------------------------------------------------
     DSSP: ------------------------------------------------------------
  PSIPRED: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEECCCCHHHHHHHHHHHHCCC
REPROFSEQ: LLLEEEELLLLLLEELLLLLLLHHHHLLLLLLLEEEEEELHHHHHHHHHHHHHHHLLLLL
       AA: MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
                   10        20        30        40        50        60

  UniProt: ------------------------------------------------------------
     DSSP: --------------------------------------------------------LLLL
  PSIPRED: CCCEEECCCCCCCCCCEEEEEEECCCCCHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCC
REPROFSEQ: LEEEEELLLLLLLLLHHHHHLLLLLLLLHHHHHHHHHHHHHLLLLHHHHLLLLLLLLLLL
       AA: NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
                   70        80        90       100       110       120

  UniProt: ----HHHHHHHHH------HHH----TTTT-HHHHHHHHHHHHHHHH-------------
     DSSP: LLSBGGGGGGGGGGSBSSLGGGSTTSTTTTLHHHHHHHHHHHHHHHTLLTTSLLLLLLLL
  PSIPRED: CCCCHHHHHHHHHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
REPROFSEQ: LLHHHHHHHHHHHHHHHLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHLLLLLLLLEEEL
       AA: FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
                  130       140       150       160       170       180

  UniProt: HHHHHHHHHHHHHHH--HHHH--HHHHHHHHHHHHHH---------HHHHHHHHHHHH--
     DSSP: HHHHHHHHHHHHHHHHHHHHHBLHHHHHHHHHHHHHHLLBTTBLLLHHHHHHHHHHHHSL
  PSIPRED: HHHHHHHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHCCC
REPROFSEQ: LHLHLLHHHHHHHHHHHHHLLLLHHHHHHHHHHHHHLLLLLLLLLLHHHHHHHHHHLLLL
       AA: EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
                  190       200       210       220       230       240

  UniProt: EEEE------HHHHHHHHTTTEEEE-----------------HHHHHHH-HHHH--HHHH
     DSSP: EEEELLSBLLHHHHHHHHTTTEEEELLLLLLTTSTTLLSSLLHHHHHHHTHHHHTSHHHH
  PSIPRED: EEEECCCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCHHHH
REPROFSEQ: LLLLLLLLLLLLLHHLLHHHEEEEEEEEELLLLLLLLLLLHHHHHHHHLLLLLLLLLLHH
       AA: RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
                  250       260       270       280       290       300

  UniProt: HHHHHHHHHH----HHHHHHHHHHHHHTTTT-EEEE--EEEE--HHHH--HHHHHHH-HH
     DSSP: HHHHHHHHHHTTLLHHHHHHHHHHHHHTTTTLEEEETTEEEELLHHHHTLHHHHHHTTSS
  PSIPRED: HHHHHHHHHHCCCCHHHHHHHHHHHHEEEEEEEEECCCCEEEECCCCCCCCCCCCCCCCC
REPROFSEQ: HHHHHLLLLLLLLLHHHHHHHHHHEEEEEEELELLLLLLHHHHHHHHHLHHHHHHHHHLL
       AA: QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
                  310       320       330       340       350       360

  UniProt: HHHHHH--HHHH------EEE--EEEEEEE-HHHHHHHHHHHHH----EEEEEEETTTTE
     DSSP: SSEEEELLHHHHTTLLLLSSSLLSEEEEESLHHHHHHHHHHHHHTSLLSSLEEEETTTTE
  PSIPRED: CCCCCCCCHHHHHCCCCCCCCCCCEEEEECCHHHHHHHHHHHHHCCCCCCCCCCCCCCCE
REPROFSEQ: LLLLLLLLLLLELLLLLLELLLLLEEEHHHLHHHHHHHHHHHHHLLLLLLEEEELLLLEE
       AA: KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
                  370       380       390       400       410       420

  UniProt: EEEEHHHHHHHHHHHHHHHHHHHHHHHHH---
     DSSP: EEEL----------------------------
  PSIPRED: EEECCCHHHHHHHHHHHHHHHHHHHHHHHHHC
REPROFSEQ: EEELLLLHHHHHHHHHLHHHHHHHHHHHHHLL
       AA: IEVLDNTQQLKILADSINSEIGILCSALQKIK
                  430       440       450       460

Note that there are even some minor differences between the Uniprot and the dssp prediction. Most can be explained by the usage of different terms or different definitions for terms like "Turn", "Bend" or "Sheet", but e.g. from position 181- 202 dssp calculates a continuous helix while Uniprot shows a break at 195-196. We tried, but could not identify the source of the Uniprot structure. <br\>Helical regions agree often very well in all the given models and predictions, but prediction of sheets poses a more difficult problem. The alignment above suggests that a reason might be that the sheets are shorter on average than the helix motives.

Disorderd Regions

This section emphasises the parts of the structure, which have no structure. What seems fairly easy to do, because its a byproduct of predicting the ordered parts, is a quite unmanageable topic. The tools provided to predict such regions are as many as those for any other structure prediction. <ref name=disorderpred> Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A. and Kryshtafovych, A. (2011), Evaluation of disorder predictions in CASP9. Proteins, 79: 107–118. doi: 10.1002/prot.23161</ref>
In order to have a closer look into the topic we decided to not only use iupred like it was specified in the task, but also find other webserver which provide a similar possibilities. A nice summary of the paper cited above can be found here, without the performance analysis from the paper. But since we want to this by ourselves it is quite sufficient.

iupred

Spine-D

Spine-DM

Transmembreane Helices

Signalpeptides

P47863 from rat, rest human => type eukaryotes; command:

signalp -trunc 70 -t euk <ID>.fasta > <ID>.sigP_out
SignalP v3 SignalP v4 Signal Peptide Database
UniprotID SignalP-HMM SignalP_NN summary prediction summary prediction experimental
P00439 0% signal peptide or anchor 5 times No no signal peptide or anchor no signal peptide or anchor no evidence for peptide or anchor
P02768 100% signal peptide, 0% signal anchor 5 times Yes signal peptide signal peptide confirmed signal peptide
P11279 100% signal peptide, 0% signal anchor 5 times Yes signal peptide signal peptide confirmed signal peptide
P47863 52.6% signal peptide, 45.7 signal anchor 4 No, 1 Yes signal peptide no signal peptide or anchor no evidence for peptide or anchor

P47863 reaches a high maximal signal peptide score (S-score) in signalP-NN and probabilities above the cut-off in signalP-HMM in version 3. In version 4 (webserver) neither version predicts a signal peptide for this protein and the probabilities/scores are well below the cut-off values. P47863 is a multipass membrane protein and signalPv3 most likely predicted a possible cleavage site for the N-terminal membrane region, classifying it wrongly as signal sequence.

GO Terms

GOPet

GOPet predicted GO terms
GO ID Aspect Confidence GO Term True/False
GO:0003824 F 94% catalytic activity true
GO:0016491 F 88% oxidoreductase activity true
GO:0004497 F 87% monooxygenase activity true
GO:0004505 F 84% phenylalanine 4-monooxygenase activity true
GO:0004510 F 80% tryptophan 5-monooxygenase activity false
GO:0004511 F 79% tyrosine 3-monooxygenase activity false
GO:0046872 F 78% metal ion binding true
GO:0005506 F 78% iron ion binding true
GO:0008199 F 72% ferric iron binding false
GO:0008198 F 72% ferrous iron binding false
GO:0016597 F 71% amino acid binding true

ProtFun

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis           => 0.210    9.530
 Biosynthesis_of_cofactors            0.229    3.180
 Cell_envelope                        0.034    0.563
 Cellular_processes                   0.063    0.867
 Central_intermediary_metabolism      0.061    0.970
 Energy_metabolism                    0.343    3.815
 Fatty_acid_metabolism                0.025    1.889
 Purines_and_pyrimidines              0.392    1.615
 Regulatory_functions                 0.020    0.125
 Replication_and_transcription        0.118    0.438
 Translation                          0.204    4.630
 Transport_and_binding                0.024    0.060

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                            => 0.724    2.527
 Nonenzyme                            0.276    0.387

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.154    0.738
 Transferase    (EC 2.-.-.-)          0.271    0.785
 Hydrolase      (EC 3.-.-.-)          0.083    0.261
 Lyase          (EC 4.-.-.-)          0.047    1.002
 Isomerase      (EC 5.-.-.-)       => 0.100    3.138
 Ligase         (EC 6.-.-.-)          0.019    0.370

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.075    0.350
 Receptor                             0.003    0.016
 Hormone                              0.001    0.206
 Structural_protein                   0.005    0.166
 Transporter                          0.025    0.229
 Ion_channel                          0.010    0.168
 Voltage-gated_ion_channel            0.005    0.232
 Cation_channel                       0.010    0.215
 Transcription                        0.043    0.334
 Transcription_regulation             0.032    0.255
 Stress_response                      0.010    0.118
 Immune_response                      0.012    0.140
 Growth_factor                        0.006    0.407
 Metal_ion_transport                  0.009    0.020

Additional Predictions:

Feature 	Output summary
   SignalP 3.0 	  No signal peptide cleavage site predicted 
   ProP 1.0 	  1 propeptide cleavage site predicted at position:   74 
   TargetP 1.1 	  No high confidence targeting predition 
   NetPhos 2.0 	  22 putative phosphorylation sites at positions 16 23 40 70 110 196 250 303 339 391 411 22 105 189 278 418 24 77 198 268 277 317
   NetOGlyc 3.1   No O-glycosylated sites predicted
   NetNGlyc 1.0   2 putative N-glycosylated sites at positions 61 376
   TMHMM 2.0 	  No TM helices predicted

References

<references/>