Sequence-based predictions (PKU)
In this task we will find out, how much information one can get from the plain sequence of our disease. We try to pretend we know nothing but the freshly sequenced primary-structure. We will perform predictions which are available as a web service as well as local programs to determine the functions and structure of out disease. Afterwards we will analyse the results with our prior knowledge about Phenylketonuria. In this manner we hope to get an idea how reliable the information derived from such tools is.
Contents
Short Task Description
Our task is to use the primary sequence of our protein (and some other example proteins) to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced like GO terms and similar functional annotations with different tools. The used commands and programms are listed at the appropriate places (if short and interesting enough) or linked at their own site. For a more detailed instruction please read.. and while you're at it, read some more
Secondary Structure Prediction
In this section we show a comparison of several secondary-structure-prediction-tools. In order to have the gold standard we will analyse the results with the information from DSSP.<ref name=DSSP> Kabsch W, Sander C (1983) Biopolymers. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, 22, 2577-2637.</ref><ref name=DSSPDB> Joosten RP, Te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) NAR. A series of PDB related databases for everyday needs.</ref>
ReProfSeq
reprof -i <uniprotID>.fasta egrep -v "^#|^No" <uniprotID>.reprof |awk '{print $3}'|tr -d '\n' > <uniprotID>_reprof.secstruc
PsiPred
Webserver here
grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc
DSSP
Download dssp here and get PDB files matching the Uniprot entries
dssp -i <PDBID>.pdb > <PDBID>.dssp tail -n+29 <PDBID>.dssp |cut -c17|tr ' ' '-'|tr -d '\n' > <PDBID>_dssp.secstruc
PDB-Files contains only part of the structure or more than 1 chain! e.g. 117aa-424aa for PAH.
Aligned structure to sequence manually..
Script to calculate Q3 and SOV: ss_score.py
results
Protein | ReprofSeq | |||||||
---|---|---|---|---|---|---|---|---|
UniprotID | Q_E | Q_H | Q_L | Q_3 | SOV_E | SOV_H | SOV_L | SOV |
P00439 | 45.5 | 75.7 | 74.8 | 71.2 | 47.5 | 86.7 | 71.8 | 76.4 |
P10775 | 23.1 | 71.9 | 62.0 | 61.8 | 21.9 | 87.2 | 66.9 | 70.6 |
Q9X0E6 | 26.8 | 91.9 | 39.1 | 53.5 | 36.1 | 91.1 | 41.0 | 57.6 |
Q08209 | 69.1 | 34.0 | 73.1 | 57.9 | 62.4 | 45.6 | 76.7 | 63.0 |
Protein | PsiPred | |||||||
---|---|---|---|---|---|---|---|---|
UniprotID | Q_E | Q_H | Q_L | Q_3 | SOV_E | SOV_H | SOV_L | SOV |
P00439 | 57.6 | 85.1 | 89.0 | 83.8 | 57.6 | 88.5 | 54.4 | 71.1 |
P10775 | 92.3 | 90.3 | 94.7 | 92.5 | 92.7 | 95.5 | 95.7 | 95.2 |
Q9X0E6 | 75.6 | 86.5 | 78.3 | 80.2 | 93.44 | 100.0 | 94.5 | 96.1 |
Q08209 | 58.2 | 77.3 | 90.7 | 81.0 | 64.1 | 85.8 | 64.8 | 72.5 |
UniProt: ------------------------------------------------------------ DSSP: ------------------------------------------------------------ PSIPRED: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEECCCCHHHHHHHHHHHHCCC REPROFSEQ: LLLEEEELLLLLLEELLLLLLLHHHHLLLLLLLEEEEEELHHHHHHHHHHHHHHHLLLLL AA: MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV 10 20 30 40 50 60 UniProt: ------------------------------------------------------------ DSSP: --------------------------------------------------------LLLL PSIPRED: CCCEEECCCCCCCCCCEEEEEEECCCCCHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCC REPROFSEQ: LEEEEELLLLLLLLLHHHHHLLLLLLLLHHHHHHHHHHHHHLLLLHHHHLLLLLLLLLLL AA: NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW 70 80 90 100 110 120 UniProt: ----HHHHHHHHH------HHH----TTTT-HHHHHHHHHHHHHHHH------------- DSSP: LLSBGGGGGGGGGGSBSSLGGGSTTSTTTTLHHHHHHHHHHHHHHHTLLTTSLLLLLLLL PSIPRED: CCCCHHHHHHHHHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCCCCCCC REPROFSEQ: LLHHHHHHHHHHHHHHHLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHLLLLLLLLEEEL AA: FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM 130 140 150 160 170 180 UniProt: HHHHHHHHHHHHHHH--HHHH--HHHHHHHHHHHHHH---------HHHHHHHHHHHH-- DSSP: HHHHHHHHHHHHHHHHHHHHHBLHHHHHHHHHHHHHHLLBTTBLLLHHHHHHHHHHHHSL PSIPRED: HHHHHHHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHCCC REPROFSEQ: LHLHLLHHHHHHHHHHHHHLLLLHHHHHHHHHHHHHLLLLLLLLLLHHHHHHHHHHLLLL AA: EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF 190 200 210 220 230 240 UniProt: EEEE------HHHHHHHHTTTEEEE-----------------HHHHHHH-HHHH--HHHH DSSP: EEEELLSBLLHHHHHHHHTTTEEEELLLLLLTTSTTLLSSLLHHHHHHHTHHHHTSHHHH PSIPRED: EEEECCCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCHHHH REPROFSEQ: LLLLLLLLLLLLLHHLLHHHEEEEEEEEELLLLLLLLLLLHHHHHHHHLLLLLLLLLLHH AA: RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA 250 260 270 280 290 300 UniProt: HHHHHHHHHH----HHHHHHHHHHHHHTTTT-EEEE--EEEE--HHHH--HHHHHHH-HH DSSP: HHHHHHHHHHTTLLHHHHHHHHHHHHHTTTTLEEEETTEEEELLHHHHTLHHHHHHTTSS PSIPRED: HHHHHHHHHHCCCCHHHHHHHHHHHHEEEEEEEEECCCCEEEECCCCCCCCCCCCCCCCC REPROFSEQ: HHHHHLLLLLLLLLHHHHHHHHHHEEEEEEELELLLLLLHHHHHHHHHLHHHHHHHHHLL AA: QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE 310 320 330 340 350 360 UniProt: HHHHHH--HHHH------EEE--EEEEEEE-HHHHHHHHHHHHH----EEEEEEETTTTE DSSP: SSEEEELLHHHHTTLLLLSSSLLSEEEEESLHHHHHHHHHHHHHTSLLSSLEEEETTTTE PSIPRED: CCCCCCCCHHHHHCCCCCCCCCCCEEEEECCHHHHHHHHHHHHHCCCCCCCCCCCCCCCE REPROFSEQ: LLLLLLLLLLLELLLLLLELLLLLEEEHHHLHHHHHHHHHHHHHLLLLLLEEEELLLLEE AA: KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR 370 380 390 400 410 420 UniProt: EEEEHHHHHHHHHHHHHHHHHHHHHHHHH--- DSSP: EEEL---------------------------- PSIPRED: EEECCCHHHHHHHHHHHHHHHHHHHHHHHHHC REPROFSEQ: EEELLLLHHHHHHHHHLHHHHHHHHHHHHHLL AA: IEVLDNTQQLKILADSINSEIGILCSALQKIK 430 440 450 460
Disorderd Regions
This section emphasises the parts of the structure, which have no structure. What seems fairly easy to do, because its a byproduct of predicting the ordered parts, is a quite unmanageable topic. The tools provided to predict such regions are as many as those for any other structure prediction. <ref name=disorderpred> Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A. and Kryshtafovych, A. (2011), Evaluation of disorder predictions in CASP9. Proteins, 79: 107–118. doi: 10.1002/prot.23161</ref>
In order to have a closer look into the topic we decided to not only use iupred like it was specified in the task, but also find other webserver which provide a similar possibilities. A nice summary of the paper cited above can be found here, without the performance analysis from the paper. But since we want to this by ourselves it is quite sufficient.
iupred
Spine-D
Spine-DM
Transmembreane Helices
Signalpeptides
P47863 from rat, rest human => type eukaryotes; command:
signalp -trunc 70 -t euk <ID>.fasta > <ID>.sigP_out
SignalP v3 | SignalP v4 | Signal Peptide Database | |||
---|---|---|---|---|---|
UniprotID | SignalP-HMM | SignalP_NN | summary prediction | summary prediction | experimental |
P00439 | 0% signal peptide or anchor | 5 times No | no signal peptide or anchor | no signal peptide or anchor | no evidence for peptide or anchor |
P02768 | 100% signal peptide, 0% signal anchor | 5 times Yes | signal peptide | signal peptide | confirmed signal peptide |
P11279 | 100% signal peptide, 0% signal anchor | 5 times Yes | signal peptide | signal peptide | confirmed signal peptide |
P47863 | 52.6% signal peptide, 45.7 signal anchor | 4 No, 1 Yes | signal peptide | no signal peptide or anchor | no evidence for peptide or anchor |
P47863 reaches a high maximal signal peptide score (S-score) in signalP-NN and probabilities above the cut-off in signalP-HMM in version 3. In version 4 (webserver) neither version predicts a signal peptide for this protein and the probabilities/scores are well below the cut-off values. P47863 is a multipass membrane protein and signalPv3 most likely predicted a possible cleavage site for the N-terminal membrane region, classifying it wrongly as signal sequence.
GO Terms
GOPet
GOPet predicted GO terms | ||||
---|---|---|---|---|
GO ID | Aspect | Confidence | GO Term | True/False |
GO:0003824 | F | 94% | catalytic activity | true |
GO:0016491 | F | 88% | oxidoreductase activity | true |
GO:0004497 | F | 87% | monooxygenase activity | true |
GO:0004505 | F | 84% | phenylalanine 4-monooxygenase activity | true |
GO:0004510 | F | 80% | tryptophan 5-monooxygenase activity | false |
GO:0004511 | F | 79% | tyrosine 3-monooxygenase activity | false |
GO:0046872 | F | 78% | metal ion binding | true |
GO:0005506 | F | 78% | iron ion binding | true |
GO:0008199 | F | 72% | ferric iron binding | false |
GO:0008198 | F | 72% | ferrous iron binding | false |
GO:0016597 | F | 71% | amino acid binding | true |
ProtFun
# Functional category Prob Odds Amino_acid_biosynthesis => 0.210 9.530 Biosynthesis_of_cofactors 0.229 3.180 Cell_envelope 0.034 0.563 Cellular_processes 0.063 0.867 Central_intermediary_metabolism 0.061 0.970 Energy_metabolism 0.343 3.815 Fatty_acid_metabolism 0.025 1.889 Purines_and_pyrimidines 0.392 1.615 Regulatory_functions 0.020 0.125 Replication_and_transcription 0.118 0.438 Translation 0.204 4.630 Transport_and_binding 0.024 0.060 # Enzyme/nonenzyme Prob Odds Enzyme => 0.724 2.527 Nonenzyme 0.276 0.387 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.154 0.738 Transferase (EC 2.-.-.-) 0.271 0.785 Hydrolase (EC 3.-.-.-) 0.083 0.261 Lyase (EC 4.-.-.-) 0.047 1.002 Isomerase (EC 5.-.-.-) => 0.100 3.138 Ligase (EC 6.-.-.-) 0.019 0.370 # Gene Ontology category Prob Odds Signal_transducer 0.075 0.350 Receptor 0.003 0.016 Hormone 0.001 0.206 Structural_protein 0.005 0.166 Transporter 0.025 0.229 Ion_channel 0.010 0.168 Voltage-gated_ion_channel 0.005 0.232 Cation_channel 0.010 0.215 Transcription 0.043 0.334 Transcription_regulation 0.032 0.255 Stress_response 0.010 0.118 Immune_response 0.012 0.140 Growth_factor 0.006 0.407 Metal_ion_transport 0.009 0.020
Additional Predictions:
Feature Output summary SignalP 3.0 No signal peptide cleavage site predicted ProP 1.0 1 propeptide cleavage site predicted at position: 74 TargetP 1.1 No high confidence targeting predition NetPhos 2.0 22 putative phosphorylation sites at positions 16 23 40 70 110 196 250 303 339 391 411 22 105 189 278 418 24 77 198 268 277 317 NetOGlyc 3.1 No O-glycosylated sites predicted NetNGlyc 1.0 2 putative N-glycosylated sites at positions 61 376 TMHMM 2.0 No TM helices predicted
References
<references/>