Difference between revisions of "Sequence-based predictions (PKU)"

From Bioinformatikpedia
(ProtFun)
(References)
 
(7 intermediate revisions by the same user not shown)
Line 2: Line 2:
   
 
Our task is to use the primary sequence of our protein (and some other example proteins) where we pretend to know nothing but the sequenced primary structure to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced ones like GO terms and similar functional annotations with different tools. The used commands and programs are listed at the appropriate places (if short and interesting enough) or linked at their own [[Collection_of_scripts| site]].
 
Our task is to use the primary sequence of our protein (and some other example proteins) where we pretend to know nothing but the sequenced primary structure to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced ones like GO terms and similar functional annotations with different tools. The used commands and programs are listed at the appropriate places (if short and interesting enough) or linked at their own [[Collection_of_scripts| site]].
For more detailed instructions please [[Task_3_-_Sequence-based_predictions|read yourselves]]. We will use tools available as a web service or programs installed locally to determine the functions and structure of out disease and try to analyze the results with our prior knowledge of [[Phenylketonuria]].
+
For more detailed instructions please [[Task_3_-_Sequence-based_predictions|read yourselves]]. We will use tools available as a web service or programs installed locally to determine the function and structure of our protein and try to analyze the results with our prior knowledge of [[Phenylketonuria]].
   
 
==Secondary Structure Prediction==
 
==Secondary Structure Prediction==
Line 53: Line 53:
 
===Results===
 
===Results===
   
To evaluate the predicted structures against the reference from DSSP, we used the Q3 and SOV scores (script: [[Collection_of_scripts#ss_score.py|ss_score.py]]). The tables 2 and 3 show Q (percentage of true positives) and SOV (segment overlap) scores for the predictions of PsiPred and ReProfSeq compared to the structure given by DSSP. The output has been mapped to a simple three state model of E,H and L since DSSP uses a more complex eight state model. It seems clear, that PsiPred outperforms ReProfSeq in all categories, especially the Q_E of ReProfSeq is very low (in this very small dataset). To confirm that this is not an error in the calculation, we compared the structure predictions manually and ReProfSeq indeed misses many sheet structures, predicts helices as sheets and vice versa so there is not even a partial success to declare. PsiPred performs well in all types of structure but the worst results are also in the prediction of sheets. It is interesting that the protein with the best Q_E for PsiPred is the one with the worst Q_E for ReProfSeq and the protein with the best Q_3 for ReProfSeq is the one with the (very closely) second to worst Q_E for PsiPred. <br />PsiPred and ReProf also provides a confidence for its predictions as shown for PsiPred in <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>. As can be expected, the confidence often drops at the border between one segment and the next and as can be seen from the alignment of the prediction and structure references, these regions are more error prone than others.
+
To evaluate the predicted structures against the reference from DSSP, we used the Q3 and SOV scores (script: [[Collection_of_scripts#ss_score.py|ss_score.py]]). The tables 2 and 3 show Q (percentage of true positives) and SOV (segment overlap) scores for the predictions of PsiPred and ReProfSeq compared to the structure given by DSSP. The output has been mapped to a simple three state model of E,H and L since DSSP uses a more complex eight state model. It seems clear, that PsiPred outperforms ReProfSeq in all categories, especially the Q_E of ReProfSeq is very low (in this very small dataset). To confirm that this is not an error in the calculation, we compared the structure predictions manually and ReProfSeq indeed misses many sheet structures, predicts helices as sheets and vice versa so there is not even a partial success to declare. PsiPred performs well in all types of structure but the worst results are also in the prediction of sheets. It is interesting that the protein with the best Q_E for PsiPred is the one with the worst Q_E for ReProfSeq and the protein with the best Q_3 for ReProfSeq is the one with the (very closely) second to worst Q_E for PsiPred. However, even the worst scores of PsiPred are on par with the best scores of ReProfSeq so from this sample we recommend using PsiPred for predictions.<br />PsiPred and ReProfSeq also provide a confidence for their predictions as shown for PsiPred in <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>. As can be expected, the confidence often drops at the border between one segment and the next and as can be seen from the alignment of the prediction and structure references, these regions are more error prone than others.
   
 
Mapping between the different models:
 
Mapping between the different models:
Line 300: Line 300:
 
</figtable>
 
</figtable>
   
But to reach our goal, finding out information about our protein we just currently found, we focused on the results for our ''unknown'' sequence.
+
But to reach our goal, finding information about our protein we just currently found, we focused on the results for our ''unknown'' sequence.
OnDCRF found five disordered reagions of different length in our sequence which will be shown in red:
+
OnDCRF found five disordered regions of different length in our sequence which will be shown in red:
 
DisorderRegion:1-33 100-120 136-148 269-275 445-452
 
DisorderRegion:1-33 100-120 136-148 269-275 445-452
 
<font size=4p>
 
<font size=4p>
Line 317: Line 317:
   
 
===DISOClust===
 
===DISOClust===
<figure id="fig:disoclust">[[Image:pkudisoclust.png|300px|thumb|center|<caption>Result of DisoClust for P00439: diordered regions in the beginning and the end</caption>]]</figure>Can be accessed [http://www.reading.ac.uk/bioinf/DISOclust/DISOclust_form.html here] and is a combination of hand annotated structures and meta programs, which scored the best results in the paper mentioned above. Additionally one can read a ver interesting article about this method and its database<ref>McGuffin, L. J. (2008) Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics, 24, 586-587. PubMed</ref><br>
+
<figure id="fig:disoclust">[[Image:pkudisoclust.png|300px|thumb|center|<caption>Result of DisoClust for P00439: diordered regions in the beginning and the end</caption>]]</figure>Can be accessed [http://www.reading.ac.uk/bioinf/DISOclust/DISOclust_form.html here] and is a combination of hand annotated structures and meta programs, which scored the best results in the paper mentioned above. Additionally one can read a very interesting article about this method and its database<ref>McGuffin, L. J. (2008) Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics, 24, 586-587. PubMed</ref><br>
   
 
===DisProt===
 
===DisProt===
Line 630: Line 630:
 
For this task, we increased the maximum number of reported annotations to 50 and left the other settings on the default values. This includes a confidence threshold for predictions set to 60%. GOPET only returns functional GO terms so the 'Aspect' in the result table 9 is always 'F'. As can be seen in table 9, 11 annotations with a confidence of at least 71% were reported. 7 of them appear in the 8 (functional) annotations found at [http://www.ebi.ac.uk/QuickGO/GProtein?ac=P00439 http://www.ebi.ac.uk/QuickGO] and one more, ferrous ion binding, could probably be counted as correct too, as PheOH does bind a Fe(II) ion, but we follow the official list for evaluation.<br /> The top three results are fairly general annotations and correctly reported with high confidence. The next three results, phenylalanine 4-monooxygenase activity, tryptophan 5-monooxygenase activity and tyrosine 3-monooxygenase activity, are quite similar and only the first is correct but has, of this three results, also the highest confidence. If we interpret this as "It's most likely that the protein has phenylalanine 4-monooxygenase activity, but could also have tryptophan 5- and tyrosine 3-monooxygenase activity" this statement appears quite accurate.<br/>
 
For this task, we increased the maximum number of reported annotations to 50 and left the other settings on the default values. This includes a confidence threshold for predictions set to 60%. GOPET only returns functional GO terms so the 'Aspect' in the result table 9 is always 'F'. As can be seen in table 9, 11 annotations with a confidence of at least 71% were reported. 7 of them appear in the 8 (functional) annotations found at [http://www.ebi.ac.uk/QuickGO/GProtein?ac=P00439 http://www.ebi.ac.uk/QuickGO] and one more, ferrous ion binding, could probably be counted as correct too, as PheOH does bind a Fe(II) ion, but we follow the official list for evaluation.<br /> The top three results are fairly general annotations and correctly reported with high confidence. The next three results, phenylalanine 4-monooxygenase activity, tryptophan 5-monooxygenase activity and tyrosine 3-monooxygenase activity, are quite similar and only the first is correct but has, of this three results, also the highest confidence. If we interpret this as "It's most likely that the protein has phenylalanine 4-monooxygenase activity, but could also have tryptophan 5- and tyrosine 3-monooxygenase activity" this statement appears quite accurate.<br/>
 
The results 7 to 10 concern ion binding and here the more general terms metal- and iron-binding are predicted correctly and with higher confidence than the other more specific ones. In other words, GOPET correctly predicted that PheOH binds an iron ion, but failed to predict the correct oxidation number. The last prediction, amino acid binding, is again correct but quite unspecific.<br/>
 
The results 7 to 10 concern ion binding and here the more general terms metal- and iron-binding are predicted correctly and with higher confidence than the other more specific ones. In other words, GOPET correctly predicted that PheOH binds an iron ion, but failed to predict the correct oxidation number. The last prediction, amino acid binding, is again correct but quite unspecific.<br/>
In conclusion, GOPET gives more general annotations with high accuracy and confidence, but fails to distinguish between very similar terms. Yet none of the reported terms are grossly misleading and if this were a truly unknown protein, the predictions would show clearly in the direction of the proteins true function.
+
In conclusion, GOPET gives more general annotations with high accuracy and confidence, but fails to distinguish between very similar terms. Yet none of the reported terms are grossly misleading and if this were a truly unknown protein, the predictions would hint clearly in the direction of the proteins true function.
   
 
<!--<figtable id="tbl:GOPET_results"> -->
 
<!--<figtable id="tbl:GOPET_results"> -->
Line 713: Line 713:
   
 
===ProtFun===
 
===ProtFun===
ProtFun correctly reports, that PheOH is involved in the biosynthesis of amino acids. This is also (correctly) the only functional category with signifikant odds. ProtFun also correctly reports that PheOH is an enzyme, but fails to deduce the correct class. PheOH is an oxidoreductase with the EC number 1.14.16.1. The correct class has only the fourth highest odds of six possible classes.<br/>
+
ProtFun correctly reports, that PheOH is involved in the biosynthesis of amino acids. This is also (correctly) the only functional category with significant odds. ProtFun also correctly reports that PheOH is an enzyme, but fails to deduce the correct class. PheOH is an oxidoreductase with the EC number 1.14.16.1. The correct class has only the fourth highest odds of six possible classes.<br/>
 
In general, ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1, as is the case here. Indeed none of the categories fit PheOH.
 
In general, ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1, as is the case here. Indeed none of the categories fit PheOH.
   
Line 805: Line 805:
 
==References==
 
==References==
 
<references/>
 
<references/>
  +
  +
[[Category: Phenylketonuria 2012]]

Latest revision as of 11:47, 29 August 2012

Short Task Description

Our task is to use the primary sequence of our protein (and some other example proteins) where we pretend to know nothing but the sequenced primary structure to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced ones like GO terms and similar functional annotations with different tools. The used commands and programs are listed at the appropriate places (if short and interesting enough) or linked at their own site. For more detailed instructions please read yourselves. We will use tools available as a web service or programs installed locally to determine the function and structure of our protein and try to analyze the results with our prior knowledge of Phenylketonuria.

Secondary Structure Prediction

In this section we show a comparison of several secondary-structure-prediction-tools. As evaluation standard we will use the information from DSSP.<ref name=DSSP> Kabsch W, Sander C (1983) Biopolymers. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, 22, 2577-2637.</ref><ref name=DSSPDB> Joosten RP, Te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) NAR. A series of PDB related databases for everyday needs.</ref>

ReProfSeq

Reprof is already installed on the student machines. To get the structure prediction, we used the following commands:

reprof -i <uniprotID>.fasta
egrep -v "^#|^No" <uniprotID>.reprof |awk '{print $3}'|tr -d '\n' > <uniprotID>_reprof.secstruc

PsiPred

We used the webserver provided here, that also produces a visualization of the structure with added confidence scores as seen in figures <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>.

grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc

<figure id="fig:psipred_PheOH_1">

Visual output of PsiPred for PheOH, including confidence for all positions (part 1)

</figure> <figure id="fig:psipred_PheOH_2">

Visual output of PsiPred for PheOH, including confidence for all positions (part 2)

</figure>

DSSP

We downloaded the executable of DSSP here and got the PDB files matching the Uniprot entries we want to predict as shown in table 1.

UniprotID PDBID
P00439 1PAH
Q9X0E6 1VHF
P10775 2BNH
Q08209 1AUI

Table 1: The structures in PDB used as reference structure for the Uniprot sequences.


The calculation of the structure is very simple:

dssp -i <PDBID>.pdb > <PDBID>.dssp

But PDB files might contain only part of the structure, alternatives for some positions or more than one chain, e.g. 1PAH contains only residues 117-424. This makes is necessary to align the structure manually to the sequence.

Results

To evaluate the predicted structures against the reference from DSSP, we used the Q3 and SOV scores (script: ss_score.py). The tables 2 and 3 show Q (percentage of true positives) and SOV (segment overlap) scores for the predictions of PsiPred and ReProfSeq compared to the structure given by DSSP. The output has been mapped to a simple three state model of E,H and L since DSSP uses a more complex eight state model. It seems clear, that PsiPred outperforms ReProfSeq in all categories, especially the Q_E of ReProfSeq is very low (in this very small dataset). To confirm that this is not an error in the calculation, we compared the structure predictions manually and ReProfSeq indeed misses many sheet structures, predicts helices as sheets and vice versa so there is not even a partial success to declare. PsiPred performs well in all types of structure but the worst results are also in the prediction of sheets. It is interesting that the protein with the best Q_E for PsiPred is the one with the worst Q_E for ReProfSeq and the protein with the best Q_3 for ReProfSeq is the one with the (very closely) second to worst Q_E for PsiPred. However, even the worst scores of PsiPred are on par with the best scores of ReProfSeq so from this sample we recommend using PsiPred for predictions.
PsiPred and ReProfSeq also provide a confidence for their predictions as shown for PsiPred in <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>. As can be expected, the confidence often drops at the border between one segment and the next and as can be seen from the alignment of the prediction and structure references, these regions are more error prone than others.

Mapping between the different models:

  • E = E (dssp), E (reprof), E (psipred)
  • H = HGI (dssp), H (reprof), H (psipred)
  • L = BSTL (dssp), L (reprof), C (psipred)
Protein ReprofSeq
UniprotID Q_E Q_H Q_L Q_3 SOV_E SOV_H SOV_L SOV
P00439 45.5 75.7 74.8 71.2 47.5 86.7 71.8 76.4
P10775 23.1 71.9 62.0 61.8 21.9 87.2 66.9 70.6
Q9X0E6 26.8 91.9 39.1 53.5 36.1 91.1 41.0 57.6
Q08209 69.1 34.0 73.1 57.9 62.4 45.6 76.7 63.0

Table 2: Q- and SOV-values for ReProfSeq in comparison to the DSSP structure.

Protein PsiPred
UniprotID Q_E Q_H Q_L Q_3 SOV_E SOV_H SOV_L SOV
P00439 57.6 85.1 89.0 83.8 57.6 88.5 54.4 71.1
P10775 92.3 90.3 94.7 92.5 92.7 95.5 95.7 95.2
Q9X0E6 75.6 86.5 78.3 80.2 93.44 100.0 94.5 96.1
Q08209 58.2 77.3 90.7 81.0 64.1 85.8 64.8 72.5

Table 3: Q- and SOV-values for PsiPred in comparison to the DSSP structure.

Different predictions and reference from Uniprot for the structure of PheOH:

  UniProt: ------------------------------------------------------------
     DSSP: ------------------------------------------------------------
  PSIPRED: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEECCCCHHHHHHHHHHHHCCC
REPROFSEQ: LLLEEEELLLLLLEELLLLLLLHHHHLLLLLLLEEEEEELHHHHHHHHHHHHHHHLLLLL
       AA: MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
                   10        20        30        40        50        60

  UniProt: ------------------------------------------------------------
     DSSP: --------------------------------------------------------LLLL
  PSIPRED: CCCEEECCCCCCCCCCEEEEEEECCCCCHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCC
REPROFSEQ: LEEEEELLLLLLLLLHHHHHLLLLLLLLHHHHHHHHHHHHHLLLLHHHHLLLLLLLLLLL
       AA: NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
                   70        80        90       100       110       120

  UniProt: ----HHHHHHHHH------HHH----TTTT-HHHHHHHHHHHHHHHH-------------
     DSSP: LLSBGGGGGGGGGGSBSSLGGGSTTSTTTTLHHHHHHHHHHHHHHHTLLTTSLLLLLLLL
  PSIPRED: CCCCHHHHHHHHHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
REPROFSEQ: LLHHHHHHHHHHHHHHHLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHLLLLLLLLEEEL
       AA: FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
                  130       140       150       160       170       180

  UniProt: HHHHHHHHHHHHHHH--HHHH--HHHHHHHHHHHHHH---------HHHHHHHHHHHH--
     DSSP: HHHHHHHHHHHHHHHHHHHHHBLHHHHHHHHHHHHHHLLBTTBLLLHHHHHHHHHHHHSL
  PSIPRED: HHHHHHHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHCCC
REPROFSEQ: LHLHLLHHHHHHHHHHHHHLLLLHHHHHHHHHHHHHLLLLLLLLLLHHHHHHHHHHLLLL
       AA: EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
                  190       200       210       220       230       240

  UniProt: EEEE------HHHHHHHHTTTEEEE-----------------HHHHHHH-HHHH--HHHH
     DSSP: EEEELLSBLLHHHHHHHHTTTEEEELLLLLLTTSTTLLSSLLHHHHHHHTHHHHTSHHHH
  PSIPRED: EEEECCCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCHHHH
REPROFSEQ: LLLLLLLLLLLLLHHLLHHHEEEEEEEEELLLLLLLLLLLHHHHHHHHLLLLLLLLLLHH
       AA: RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
                  250       260       270       280       290       300

  UniProt: HHHHHHHHHH----HHHHHHHHHHHHHTTTT-EEEE--EEEE--HHHH--HHHHHHH-HH
     DSSP: HHHHHHHHHHTTLLHHHHHHHHHHHHHTTTTLEEEETTEEEELLHHHHTLHHHHHHTTSS
  PSIPRED: HHHHHHHHHHCCCCHHHHHHHHHHHHEEEEEEEEECCCCEEEECCCCCCCCCCCCCCCCC
REPROFSEQ: HHHHHLLLLLLLLLHHHHHHHHHHEEEEEEELELLLLLLHHHHHHHHHLHHHHHHHHHLL
       AA: QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
                  310       320       330       340       350       360

  UniProt: HHHHHH--HHHH------EEE--EEEEEEE-HHHHHHHHHHHHH----EEEEEEETTTTE
     DSSP: SSEEEELLHHHHTTLLLLSSSLLSEEEEESLHHHHHHHHHHHHHTSLLSSLEEEETTTTE
  PSIPRED: CCCCCCCCHHHHHCCCCCCCCCCCEEEEECCHHHHHHHHHHHHHCCCCCCCCCCCCCCCE
REPROFSEQ: LLLLLLLLLLLELLLLLLELLLLLEEEHHHLHHHHHHHHHHHHHLLLLLLEEEELLLLEE
       AA: KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
                  370       380       390       400       410       420

  UniProt: EEEEHHHHHHHHHHHHHHHHHHHHHHHHH---
     DSSP: EEEL----------------------------
  PSIPRED: EEECCCHHHHHHHHHHHHHHHHHHHHHHHHHC
REPROFSEQ: EEELLLLHHHHHHHHHLHHHHHHHHHHHHHLL
       AA: IEVLDNTQQLKILADSINSEIGILCSALQKIK
                  430       440       450       460

Note that there are even some minor differences between the Uniprot reference and the DSSP prediction. Most can be explained by the usage of different terms or different definitions for terms like "Turn", "Bend" or "Sheet", but e.g. from position 181- 202 DSSP calculates a continuous helix while Uniprot shows a break at 195-196. We tried, but could not identify the source of the Uniprot structure. <br\>Helical regions agree often very well in all the given models and predictions, but prediction of sheets poses a more difficult problem. The alignment above suggests that a reason might be that the sheets are shorter on average than the helix motives.

Disorderd Regions

This section focuses on the parts of the structure that have no structure. What seems fairly easy to do, because it is a byproduct of predicting the ordered parts, is quite an unmanageable topic. The tools provided to predict such regions are as diverse as those for any other structure prediction. <ref name=disorderpred> Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A. and Kryshtafovych, A. (2011), Evaluation of disorder predictions in CASP9. Proteins, 79: 107–118. doi: 10.1002/prot.23161</ref>
In order to have a closer look at the topic we decided not only to use IUPred as was specified in the task, but also to find other webservers which provide a similar service. A nice summary of the paper cited above can be found here, without the performance analysis from the paper. But since we want to this by ourselves that is quite sufficient.

IUPred

iupred <uniprot-id>.fasta short > iupred<uinprot-id>.short
iupred <uniprot-id>.fasta long  > iupred<uinprot-id>.long
iupred <uniprot-id>.fasta glob  > iupred<uinprot-id>.glob

A first attempt to predict the unstructured regions we first used IUPred, which is a simple, yet not well documented tool. In order to get a better understanding of the results one can get from IUPred we compared the results from the 'short' option with those of the 'long'. (see <xr id="iupredcomp"/>)

<figtable id="iupredcomp">

iupred result for P00439
iupred result for Q9X0E6
iupred result for P10775
iupred result for Q08209
Comparison of long and short option for each protein with IUPred

</figtable>

As one can see above, all but protein Q9X0E6 have in all regions for all options a relatively small chance per residue to be disordered. Aside from the beginning and the end (meaning about 10 residues), which seem to be disordered for any protein using the short option. The graphs overall remain below the 0.5 margin, when looking at the short gradient. With respect to the long variant, one can say that the proteins are expected to be ordered except for the beginning and the end. As for protein Q9X0E6 which is the only one with a significant rise of both lapses over the 50%-probability margin for about the last 100 residues one can clearly see that the tendency for this end of the protein is to be disordered.

Spine-D

After prediction with IUPred we decided to use another approach which will be re-used in Spine-DM with the results from IUPred.
The Spine-D approach consists of a two-layered Neural Network followed by a filter. The input features include residue-level and window-level information calculated from amino acid sequence, seven representative physical parameters, PSI-BLAST profile, predicted secondary structure and solvent accessibility torsion-angle fluctuation.

<figtable id="spine-dcomp">

Spine-D result for P00439
Spine-D result for Q9X0E6
Spine-D result for P10775
Spine-D result for Q08209
results of Spine-D

</figtable>

Spine-DM

Spine-DM features a meta approach that employs a two-layer Neural Network with a filter. It combines input from six disorder predictors: VSL2, DISOPred2, DisPro1.0, IUPred and SPINE-D
currently pending results due to maintenance of the server http://sparks.informatics.iupui.edu/SPINE-DM

OnDCRF

OnDCRF is a machine learning method based on Conditional Random Fields. It uses only sequence and predicted secondary structure as inputs.
In order to compare the results to the other services we also applied the method to all the proteins. If you are interested to replicate or test it on your own use this webserver.

<figtable id="ondcrfcomp">

OnDCRF result for P00439
OnDCRF result for Q9X0E6
OnDCRF result for P10775
OnDCRF result for Q08209
results of OnDCRF

</figtable>

But to reach our goal, finding information about our protein we just currently found, we focused on the results for our unknown sequence. OnDCRF found five disordered regions of different length in our sequence which will be shown in red:

DisorderRegion:1-33 100-120 136-148 269-275 445-452

mstavlenpg lgrklsdfgq etsyiedncn qngaislifs lkeevgalak vlrlfeendv 
nlthiesrps rlkkdeyeff thldkrslpa ltniikilrh digatvhels rdkkkdtvpw
fprtiqeldr fanqilsyga eldadhpgfk dpvyrarrkq fadiaynyrh gqpiprveym 
eeekktwgtv fktlkslykt hacyeynhif pllekycgfh ednipqledv sqflqtctgf 
rlrpvaglls srdflgglaf rvfhctqyir hgskpmytpe pdichellgh vplfsdrsfa 
qfsqeiglas lgapdeyiek latiywftve fglckqgdsi kaygagllss fgelqyclse 
kpkllplele ktaiqnytvt efqplyyvae sfndakekvr nfaatiprpf svrydpytqr 
ievldntqql kiladsinse igilcsalqk ik

One can clearly see the unordered regions in the beginning and end and some rather short disordered regions in the rest of the protein.

DISOClust

<figure id="fig:disoclust">

Result of DisoClust for P00439: diordered regions in the beginning and the end

</figure>Can be accessed here and is a combination of hand annotated structures and meta programs, which scored the best results in the paper mentioned above. Additionally one can read a very interesting article about this method and its database<ref>McGuffin, L. J. (2008) Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics, 24, 586-587. PubMed</ref>

DisProt

To at least have some kind of gold-standard we compared our results with the disprot database. We did not find exact matches for each protein, but we used the sequence search to find similar sequences in the database. This was only successful for three of our proteins. For Q9X0E6 we could not find a significant hit plus the best hit was about five times the length of the searched protein, so we have no way to evaluate our results here. <figtable id="disprot">

disprot result for P00439
disprot result for Q9X0E6
disprot result for P10775
disprot result for Q08209
results of DisProt legend

</figtable> In direct comparison with the database one can see that the disordered region in PheOH which is predicted by all the methods exists in the database as well. Another similarity is the long disordered region at the end of Q08209 which is also predicted by all of our used methods. Just the small disordered region in P10775 was not predicted, but one has to take into account that the compared sequence is three times as long as our real test sequence.

Comparison

Overall the results one can derive from several methods really are the same but one also can see the difference in the methods. All of them state that Q08209 has a disordered endregion and all of them agree on some small disordered regions in PheOH but the probabilities do differ as do the extents. A method like DisoClust for example predicts rather much and long disordered regions (see <xr id="fig:disoclust"/> in comparison to IUPred which just predicts one very small disordered region and only if one uses the long option (see <xr id="iupredcomp"/>. Therefore we would rather use a meta approach like OnDCRF or DisoClust to predict regions due to its more sophisticated methods, and add a crosscheck. But we also like DisoClust even better due to its compensation for disordered regions in the end and beginning.


Essentially, all models are wrong, but some are useful. - George E. P. Box


Transmembrane Helices

PolyPhobius

We used PolyPhobius to predict transmembrane helices in our protein PheOH (P00439) and three other proteins with the IDs P35462, Q9YDF8 and P47863. The necessary blast search and multiple alignment was done with the script provided by another group here, pictures were created additionally by the webserver here. We compared the results to the entries in OPM, PDBTM and Uniprot, and choose the following PDB entries corresponding to the Uniprot entries given above: P00439 to 1PAH, P35462 to 3PBL, Q9YDF8 to 1ORQ and 1ORS and P47863 to 2D57. Used criteria were wildtype instead of mutants and sequence coverage.

There are no predicted transmembrane regions in PheOH, which is confirmed in all three databases.<figure id="fig:pkupolyphobiusP00439">

Visual output of Polyphobius for P00439: no transmembranhelix

</figure>

For the other three proteins the tables 4 to 6 list the predicted and observed TMs. For P35462 and P47863 we observe that the different databases themselves agree mostly with each other and we see a good agreement between prediction and observation. The exact borders of transmembranes differ slightly (the predicted helices are often one or two residues longer) but number and position of helices on the protein agree perfectly.

Q9YDF8 seems to be a hard case: Firstly, there is not one entry in PDB we could use as reference so we had to choose two, 1ORQ and 1ORS, where different and only in some cases overlapping regions are labeled as transmembrane, secondly, there are still regions predicted as TM that do not correspond to regions in these structures, but to regions labeled TM in the Uniprot entry and thirdly, some predicted TM regions correspond to regions labeled as 'intramembrane' in Uniprot. Part of the problems in the prediction and in the solving of the structure might be connected to the fact that this is a protein from the hyper-thermophilic archaea Aeropyrum pernix that might have a more complex membrane composition due to its extreme environment.

<figure id="fig:pkupolyphobiusP35462">

Visual output of Polyphobius for P35462

</figure>

P35462  Po.Ph.    OPM    PDBTM  Uniprot
TM1 30-55 34-52 35-52 33-55
TM2 66-88 67-91 68-93 66-88
TM3 105-126 101-126 109-123 105-126
TM4 150-170 150-170 152-166 150-170
TM5 188-212 187-209 191-206 188-212
TM6 329-352 330-351 334-347 330-351
TM7 367-386 363-386 368-382 367-388

Table 4: Transmembrane helices predicted by PolyPhobius compared to reports in OPM, PDBTM and Uniprot in P35462.


<figure id="fig:pkupolyphobiusQ9YDF8">

Visual output of Polyphobius for Q9YDF8

</figure>

Q9YDF8  Po.Ph.   OPMa   OPMb  PDBTMa PDBTMb Uniprot
TM1 42-60 25-46 27-50 21-52 39-63
TM2 68-88 55-78 55-75 57-80 68-92
TM3 86-97 88-107 97-105*
TM4 108-129 100-107 109-125
TM5 137-157 117-148 118-142 129-145
TM6 163-184 153-172 151-171 160-184
TM7 196-213 183-195 184-200! 196-208*
TM9 224-244 207-225 209-236 222-253

Table 5: Transmembrane helices predicted by PolyPhobius compared to reports in OPM, PDBTM and Uniprot in Q9YDF8.
a1ORS, b1ORQ, *intramembrane, ! Helix Loop


<figure id="fig:pkupolyphobiusP47863">

Visual output of Polyphobius for P47863

</figure>

P47863  Po.Ph.    OPM    PDBTM  Uniprot
TM1 34-58 34-56 39-55 37-57
TM2 70-91 70-88 72-89 65-85
TM3 115-136 112-136 116-133 116-136
TM4 156-177 156-178 158-177 156-176
TM5 188-208 189-203 188-205 185-205
TM6 231-252 231-252 231-248 232-252

Table 6: Transmembrane helices predicted by PolyPhobius compared to reports in OPM, PDBTM and Uniprot in P47863.


TMHMM

While PolyPhobius uses among other programs TMHMM we also wanted to test this and compare the results. The results are very unimpressive (no transmembran helix predicted see <xr id="fig:TMHMM"/>) so the comparison is quite simple as the outcome is the same as with PolyPhobius.

<figure id="fig:TMHMM">

Result of TMHMM for P00439: no Transmembran helices predicted

</figure>

Discussion

Altogether the accuracy of the prediction programs appears to be very high, in fact at least for PolyPhobius the error is in the same range as the differences between the reference annotations in the different databases we checked. Even in hard cases transmembrane regions are recognized and positioned correctly overall. Some special cases, as the difference between intramembrane and transmembrane segments are not reported, but we argue in favor of PolyPhobius that this is not part of the current PolyPhobius training (e.g. signal peptides as source of false positives are considered) and might be included in future versions.

Signalpeptides

We want to predict the presence of a signal peptide in our protein and some other example proteins. Since the program SignalP uses different training sets for proteins from gram negative and gram positive bacteria and eukaryotes, it is necessary to determine the organism of our proteins. P47863 is from rattus norvegicus, the others from homo sapiens, so all are of the type euk. SignalP also suggests to only use the first 70-50 N-terminal amino acids, so our command to execute the prediction reads:

signalp -trunc 70 -t euk <UniprotID>.fasta > <UniprotID>.sigP_out

SignalP-NN reports five different scores for signalpeptide, cleavage site and combined scores as well as cut-offs for these to decide if the score hints at a signal peptide or not. SignalP-HMM reports overall probabilities for a signal peptide and a signal anchor and the most likely cleavage site. Table 7 summarizes the output and compares it to experimental confirmation from the Signal Peptide Database.

SignalP v3 SignalP v4 Signal Peptide Database
UniprotID SignalP-HMM SignalP_NN summary prediction summary prediction experimental
P00439 0% signal peptide or anchor 5 times No no signal peptide or anchor no signal peptide or anchor no evidence for peptide or anchor
P02768 100% signal peptide, 0% signal anchor 5 times Yes signal peptide signal peptide confirmed signal peptide
P11279 100% signal peptide, 0% signal anchor 5 times Yes signal peptide signal peptide confirmed signal peptide
P47863 52.6% signal peptide, 45.7% signal anchor 4 No, 1 Yes signal peptide no signal peptide or anchor no evidence for peptide or anchor

Table 7: Summary of the SignalP output and evidence from Signal Peptide Database for our dataset.

For PheOH (P00439) SignalP correctly predicts no signal peptide with high confidence. For P02769 and P11279 the signal peptide is detected with high confidence and even the correct cleavage site is predicted.
P47863 reaches a high maximal signal peptide score (S-score) in signalP-NN and probabilities above the cut-off values in signalP-HMM in version 3. In version 4 (available at their webserver) neither variant predicts a signal peptide for this protein and the probabilities/scores are well below the cut-off values. P47863 is a multipass membrane protein and signalPv3 most likely predicted a possible cleavage site for the N-terminal membrane region, classifying it wrongly as signal sequence. Using the whole sequence of P47863 (i.e. omitting the -trunc parameter) raises one more of the scores of SignalP-NN above the prediction threshold and raises the probability for a signal peptide to 72.3%, showing that truncating the sequence leads to improved results.

Table 8 shows the n-,h- and c-regions predicted by SignalP-HMM (v3). P47863 shows a notably smaller probability for a cleavage site and smaller probability for a c-region, but looks similar enough to a sequence with signal peptide to make a prediction difficult.

<figtable id="tbl:signalp">

PKU P00439 sigP hmm.gif
PKU P02768 SigP hmm.gif
PKU P11279 sigP hmm.gif
PKU P47863 sigP hmm.gif
Table 8: Signal peptide regions of our dataset predicted by SignalP-HMM (v3).

</figtable>

Functional Annotations

We used the webserver version of GOPET provided here to predict GO annotations and the websever version of ProtFun provided here to predict cellular role, enzyme class and functional GO categories for PheOH. ProtFun incorporates other tools to predict signal peptides, cleavage site, glycosylation, phosphorylation and transmembrane segments to achieve its main predictions and gives a summary output of the subtools. Some of these other tools we already used and discussed above, the others are discussed in this section.

GOPet

For this task, we increased the maximum number of reported annotations to 50 and left the other settings on the default values. This includes a confidence threshold for predictions set to 60%. GOPET only returns functional GO terms so the 'Aspect' in the result table 9 is always 'F'. As can be seen in table 9, 11 annotations with a confidence of at least 71% were reported. 7 of them appear in the 8 (functional) annotations found at http://www.ebi.ac.uk/QuickGO and one more, ferrous ion binding, could probably be counted as correct too, as PheOH does bind a Fe(II) ion, but we follow the official list for evaluation.
The top three results are fairly general annotations and correctly reported with high confidence. The next three results, phenylalanine 4-monooxygenase activity, tryptophan 5-monooxygenase activity and tyrosine 3-monooxygenase activity, are quite similar and only the first is correct but has, of this three results, also the highest confidence. If we interpret this as "It's most likely that the protein has phenylalanine 4-monooxygenase activity, but could also have tryptophan 5- and tyrosine 3-monooxygenase activity" this statement appears quite accurate.
The results 7 to 10 concern ion binding and here the more general terms metal- and iron-binding are predicted correctly and with higher confidence than the other more specific ones. In other words, GOPET correctly predicted that PheOH binds an iron ion, but failed to predict the correct oxidation number. The last prediction, amino acid binding, is again correct but quite unspecific.
In conclusion, GOPET gives more general annotations with high accuracy and confidence, but fails to distinguish between very similar terms. Yet none of the reported terms are grossly misleading and if this were a truly unknown protein, the predictions would hint clearly in the direction of the proteins true function.

GOPET predictions
GO ID Aspect Confidence GO Term Confirmed by QuickGO
GO:0003824 F 94% catalytic activity true
GO:0016491 F 88% oxidoreductase activity true
GO:0004497 F 87% monooxygenase activity true
GO:0004505 F 84% phenylalanine 4-monooxygenase activity true
GO:0004510 F 80% tryptophan 5-monooxygenase activity false
GO:0004511 F 79% tyrosine 3-monooxygenase activity false
GO:0046872 F 78% metal ion binding true
GO:0005506 F 78% iron ion binding true
GO:0008199 F 72% ferric iron binding false
GO:0008198 F 72% ferrous iron binding false
GO:0016597 F 71% amino acid binding true

Table 9: GO term predictions for PheOH by GOPET.

ProtFun

ProtFun correctly reports, that PheOH is involved in the biosynthesis of amino acids. This is also (correctly) the only functional category with significant odds. ProtFun also correctly reports that PheOH is an enzyme, but fails to deduce the correct class. PheOH is an oxidoreductase with the EC number 1.14.16.1. The correct class has only the fourth highest odds of six possible classes.
In general, ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1, as is the case here. Indeed none of the categories fit PheOH.

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis           => 0.210    9.530
 Biosynthesis_of_cofactors            0.229    3.180
 Cell_envelope                        0.034    0.563
 Cellular_processes                   0.063    0.867
 Central_intermediary_metabolism      0.061    0.970
 Energy_metabolism                    0.343    3.815
 Fatty_acid_metabolism                0.025    1.889
 Purines_and_pyrimidines              0.392    1.615
 Regulatory_functions                 0.020    0.125
 Replication_and_transcription        0.118    0.438
 Translation                          0.204    4.630
 Transport_and_binding                0.024    0.060

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                            => 0.724    2.527
 Nonenzyme                            0.276    0.387

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.154    0.738
 Transferase    (EC 2.-.-.-)          0.271    0.785
 Hydrolase      (EC 3.-.-.-)          0.083    0.261
 Lyase          (EC 4.-.-.-)          0.047    1.002
 Isomerase      (EC 5.-.-.-)       => 0.100    3.138
 Ligase         (EC 6.-.-.-)          0.019    0.370

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.075    0.350
 Receptor                             0.003    0.016
 Hormone                              0.001    0.206
 Structural_protein                   0.005    0.166
 Transporter                          0.025    0.229
 Ion_channel                          0.010    0.168
 Voltage-gated_ion_channel            0.005    0.232
 Cation_channel                       0.010    0.215
 Transcription                        0.043    0.334
 Transcription_regulation             0.032    0.255
 Stress_response                      0.010    0.118
 Immune_response                      0.012    0.140
 Growth_factor                        0.006    0.407
 Metal_ion_transport                  0.009    0.020

Additional Predictions of ProtFun are shown below. Of the posttranslational modifications we could verify only one phosphorylation site at Ser16, that is thought to play a role in activity regulation. Three other phosphorylation sites reported at Phosphosite are not predicted. We could not find any positive information about glycosylation which seems to agree mostly with the predictions. PheOH contains no propeptide cleavage site as it is the mature protein and as discussed above contains no target or signal peptide and no TM helices.

Feature 	Output summary
   SignalP 3.0   No signal peptide cleavage site predicted 
   ProP 1.0      1 propeptide cleavage site predicted at position:   74 
   TargetP 1.1   No high confidence targeting predition 
   NetPhos 2.0   22 putative phosphorylation sites at positions 16 23 40 70 110 196 250 303 339 391 411 22 105 189 278 418 24 77 198 268 277 317
   NetOGlyc 3.1  No O-glycosylated sites predicted
   NetNGlyc 1.0  2 putative N-glycosylated sites at positions 61 376
   TMHMM 2.0     No TM helices predicted

Since ProtFun relies on these additional predictions to deduce functional annotations, but uses them as input in a multilayered neuronal network, it is hard to assess the importance of the individual results, especially if there is no confidence attached to them. Most of the predictions appear to be correct, but in the case of PheOH there is not much to predict. Accordingly, the main results of ProtFun exclude categories. The two most important results - PheOH might be an isomerase involved in amino acid biosynthesis - are only partially correct and somewhat misleading because a better result, oxidoreductase activity, is excluded.

PFAM

Searching among the PFAM families with the sequence of PheOH returns two families, ACT at the N-terminus and Biopterin_H at the center of the protein.

The ACT domain is supposed to be a regulatory domain that often appears in metabolic enzymes that are regulated by amino acid concentration. According to the PFAM description, a pair of ACT domains form an eight-stranded antiparallel sheet with two molecules of allosteric inhibitor serine bound in the interface. Our reference structure unfortunately only contains the structure from residues 117-424 but the predictions by PsiPred and ReProfSeq contain three to four regions that could form these antiparallel sheets.

The Biopterin_H domain belongs to a family of tetrahydrobiopterin-dependent aromatic amino acid hydroxylases, all of which are rate-limiting catalysts for important metabolic pathways. The proteins are regulated by phosphorylation at serines in their N-termini, a fact that we checked earlier for PheOH as ProtFun predicted phosphorylation sites including a Ser16.

saHMM

A further tool provided by UCMP which also offers OnDCRF is FISH. With its long name (Family Identification with Structure Anchored Hidden Markov Models) it tells you a lot and nothing. But you can scan your protein and derive some information about the structural alignment with a database they created.
With our sequence used we derived following results:

NO. saHMM id Seq.from Seq.to Score E-value
1 d.178.1.1 118 424 784.9 5.2e-234

This precisely describes the function of our protein: Aromatic aminoacid monoxygenases, catalytic and oligomerization domains

References

<references/>