Difference between revisions of "Sequence-based predictions (PKU)"
(→Functional Annotations) |
(→PFAM) |
||
Line 479: | Line 479: | ||
Searching among the PFAM families with the sequence of PheOH returns two families, [http://pfam.sanger.ac.uk/family/PF01842.20 ACT] and [http://pfam.sanger.ac.uk/family/PF00351.16 Biopterin_H]. |
Searching among the PFAM families with the sequence of PheOH returns two families, [http://pfam.sanger.ac.uk/family/PF01842.20 ACT] and [http://pfam.sanger.ac.uk/family/PF00351.16 Biopterin_H]. |
||
+ | |||
+ | The ACT domain is supposed to be a regulatory domain that often appears in metabolic enzymes that are regulated by amino acid concentration. According to the PFAM description, a pair of ACT domains form an eight-stranded antiparallel sheet with two molecules of allosteric inhibitor serine bound in the interface. Our reference structure unfortunately only contains the structure from 117-424 but the predictions by PsiPred and ReProfSeq contain three to four regions that could form these antiparallel sheets. |
||
+ | |||
+ | The Biopterin_H domain belongs to a family of tetrahydrobiopterin-dependent aromatic amino acid hydroxylases, all of which are rate-limiting catalysts for important metabolic pathways. The proteins are regulated by phosphorylation at serines in their N-termini, a fact that we checked earlier for PheOH as [[Sequence-based_predictions_(PKU)#ProtFun|ProtFun]] predicted phosphorylation sites including a Ser16. |
||
==References== |
==References== |
Revision as of 19:03, 15 May 2012
In this task we will find out, how much information one can get from the plain sequence of our disease. We try to pretend we know nothing but the freshly sequenced primary-structure. We will perform predictions which are available as a web service as well as local programs to determine the functions and structure of out disease. Afterwards we will analyse the results with our prior knowledge about Phenylketonuria. In this manner we hope to get an idea how reliable the information derived from such tools is.
Contents
Short Task Description
Our task is to use the primary sequence of our protein (and some other example proteins) to predict comparatively simple features like secondary structure, signal peptides and transmembrane regions and more advanced like GO terms and similar functional annotations with different tools. The used commands and programms are listed at the appropriate places (if short and interesting enough) or linked at their own site. For more detailed instructions please read..
Secondary Structure Prediction
In this section we show a comparison of several secondary-structure-prediction-tools. In order to have the gold standard we will analyse the results with the information from DSSP.<ref name=DSSP> Kabsch W, Sander C (1983) Biopolymers. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, 22, 2577-2637.</ref><ref name=DSSPDB> Joosten RP, Te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) NAR. A series of PDB related databases for everyday needs.</ref>
ReProfSeq
Reprof is already installed on the student machines. To get the structure prediction, we used the following commands:
reprof -i <uniprotID>.fasta egrep -v "^#|^No" <uniprotID>.reprof |awk '{print $3}'|tr -d '\n' > <uniprotID>_reprof.secstruc
PsiPred
We used the webserver provided here, that also produces a visualization of the structure with added confidence scores as seen in figures <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>.
grep Pred: <uniprotID>.psipred_out |cut -d " " -f2|tr -d '\n' > <uniprotID>_psipred.secstruc
<figure id="fig:psipred_PheOH_1">
</figure> <figure id="fig:psipred_PheOH_2">
</figure>
DSSP
We downloaded the executable of dssp here and got the PDB files matching the Uniprot entries we want to predict as shown in table 1.
UniprotID | PDBID |
---|---|
P00439 | 1PAH |
Q9X0E6 | 1VHF |
P10775 | 2BNH |
Q08209 | 1AUI |
Table 1: The structures in PDB used as reference structure for the Uniprot sequences.
The calculation of the structure is very simple:
dssp -i <PDBID>.pdb > <PDBID>.dssp
But PDB-Files might contain only part of the structure, alternatives for some positions or more than one chain, e.g. 1PAH contains only residues 117-424. This makes is necessary to align the structure manually to the sequence.
Results
To evaluate the predicted structures against the reference from dssp, we used the Q3 and SOV scores (script: ss_score.py). The tables 2 and 3 show Q (percentage of true positives) and SOV (segment overlap) scores for the predictions of PsiPred and ReProfSeq compared to the structure given by dssp. The output has been mapped to a simple three state model of E,H and L since dssp uses a more complex eight state model. It seems clear, that PsiPred outperforms ReProfSeq in all categories, especially the Q_E of ReProfSeq is very low (in this very small dataset). To confirm that this is not an error in the calculation, we compared the structure predictions manually and ReProfSeq indeed misses many sheet structures, predicts helices as sheets and vice versa so there is not even a partial success to declare. PsiPred performs well in all types of structure but the worst results are also in the prediction of sheets. It is interesting that the protein with the best Q_E for PsiPred is the one with the worst Q_E for ReProfSeq and the protein with the best Q_3 for ReProfSeq is the one with the (very closely) second to worst Q_E for PsiPred.
PsiPred also provides a confidence for its predictions as shown in <xr id="fig:psipred_PheOH_1"/> and <xr id="fig:psipred_PheOH_2"/>.
Mapping between the different models:
E = E (dssp), E (reprof), E (psipred) H = HGI (dssp), H (reprof), H (psipred) L = BSTL (dssp), L (reprof), C (psipred)
Protein | ReprofSeq | |||||||
---|---|---|---|---|---|---|---|---|
UniprotID | Q_E | Q_H | Q_L | Q_3 | SOV_E | SOV_H | SOV_L | SOV |
P00439 | 45.5 | 75.7 | 74.8 | 71.2 | 47.5 | 86.7 | 71.8 | 76.4 |
P10775 | 23.1 | 71.9 | 62.0 | 61.8 | 21.9 | 87.2 | 66.9 | 70.6 |
Q9X0E6 | 26.8 | 91.9 | 39.1 | 53.5 | 36.1 | 91.1 | 41.0 | 57.6 |
Q08209 | 69.1 | 34.0 | 73.1 | 57.9 | 62.4 | 45.6 | 76.7 | 63.0 |
Table 2: Q- and SOV-values for ReProfSeq in comparison to the dssp structure.
Protein | PsiPred | |||||||
---|---|---|---|---|---|---|---|---|
UniprotID | Q_E | Q_H | Q_L | Q_3 | SOV_E | SOV_H | SOV_L | SOV |
P00439 | 57.6 | 85.1 | 89.0 | 83.8 | 57.6 | 88.5 | 54.4 | 71.1 |
P10775 | 92.3 | 90.3 | 94.7 | 92.5 | 92.7 | 95.5 | 95.7 | 95.2 |
Q9X0E6 | 75.6 | 86.5 | 78.3 | 80.2 | 93.44 | 100.0 | 94.5 | 96.1 |
Q08209 | 58.2 | 77.3 | 90.7 | 81.0 | 64.1 | 85.8 | 64.8 | 72.5 |
Table 3: Q- and SOV-values for PsiPred in comparison to the dssp structure.
Different predictions and reference from Uniprot for the structure of PheOH:
UniProt: ------------------------------------------------------------ DSSP: ------------------------------------------------------------ PSIPRED: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEECCCCHHHHHHHHHHHHCCC REPROFSEQ: LLLEEEELLLLLLEELLLLLLLHHHHLLLLLLLEEEEEELHHHHHHHHHHHHHHHLLLLL AA: MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV 10 20 30 40 50 60 UniProt: ------------------------------------------------------------ DSSP: --------------------------------------------------------LLLL PSIPRED: CCCEEECCCCCCCCCCEEEEEEECCCCCHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCC REPROFSEQ: LEEEEELLLLLLLLLHHHHHLLLLLLLLHHHHHHHHHHHHHLLLLHHHHLLLLLLLLLLL AA: NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW 70 80 90 100 110 120 UniProt: ----HHHHHHHHH------HHH----TTTT-HHHHHHHHHHHHHHHH------------- DSSP: LLSBGGGGGGGGGGSBSSLGGGSTTSTTTTLHHHHHHHHHHHHHHHTLLTTSLLLLLLLL PSIPRED: CCCCHHHHHHHHHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCCCCCCC REPROFSEQ: LLHHHHHHHHHHHHHHHLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHLLLLLLLLEEEL AA: FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM 130 140 150 160 170 180 UniProt: HHHHHHHHHHHHHHH--HHHH--HHHHHHHHHHHHHH---------HHHHHHHHHHHH-- DSSP: HHHHHHHHHHHHHHHHHHHHHBLHHHHHHHHHHHHHHLLBTTBLLLHHHHHHHHHHHHSL PSIPRED: HHHHHHHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHCCC REPROFSEQ: LHLHLLHHHHHHHHHHHHHLLLLHHHHHHHHHHHHHLLLLLLLLLLHHHHHHHHHHLLLL AA: EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF 190 200 210 220 230 240 UniProt: EEEE------HHHHHHHHTTTEEEE-----------------HHHHHHH-HHHH--HHHH DSSP: EEEELLSBLLHHHHHHHHTTTEEEELLLLLLTTSTTLLSSLLHHHHHHHTHHHHTSHHHH PSIPRED: EEEECCCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCHHHH REPROFSEQ: LLLLLLLLLLLLLHHLLHHHEEEEEEEEELLLLLLLLLLLHHHHHHHHLLLLLLLLLLHH AA: RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA 250 260 270 280 290 300 UniProt: HHHHHHHHHH----HHHHHHHHHHHHHTTTT-EEEE--EEEE--HHHH--HHHHHHH-HH DSSP: HHHHHHHHHHTTLLHHHHHHHHHHHHHTTTTLEEEETTEEEELLHHHHTLHHHHHHTTSS PSIPRED: HHHHHHHHHHCCCCHHHHHHHHHHHHEEEEEEEEECCCCEEEECCCCCCCCCCCCCCCCC REPROFSEQ: HHHHHLLLLLLLLLHHHHHHHHHHEEEEEEELELLLLLLHHHHHHHHHLHHHHHHHHHLL AA: QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE 310 320 330 340 350 360 UniProt: HHHHHH--HHHH------EEE--EEEEEEE-HHHHHHHHHHHHH----EEEEEEETTTTE DSSP: SSEEEELLHHHHTTLLLLSSSLLSEEEEESLHHHHHHHHHHHHHTSLLSSLEEEETTTTE PSIPRED: CCCCCCCCHHHHHCCCCCCCCCCCEEEEECCHHHHHHHHHHHHHCCCCCCCCCCCCCCCE REPROFSEQ: LLLLLLLLLLLELLLLLLELLLLLEEEHHHLHHHHHHHHHHHHHLLLLLLEEEELLLLEE AA: KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR 370 380 390 400 410 420 UniProt: EEEEHHHHHHHHHHHHHHHHHHHHHHHHH--- DSSP: EEEL---------------------------- PSIPRED: EEECCCHHHHHHHHHHHHHHHHHHHHHHHHHC REPROFSEQ: EEELLLLHHHHHHHHHLHHHHHHHHHHHHHLL AA: IEVLDNTQQLKILADSINSEIGILCSALQKIK 430 440 450 460
Note that there are even some minor differences between the Uniprot and the dssp prediction. Most can be explained by the usage of different terms or different definitions for terms like "Turn", "Bend" or "Sheet", but e.g. from position 181- 202 dssp calculates a continuous helix while Uniprot shows a break at 195-196. We tried, but could not identify the source of the Uniprot structure. <br\>Helical regions agree often very well in all the given models and predictions, but prediction of sheets poses a more difficult problem. The alignment above suggests that a reason might be that the sheets are shorter on average than the helix motives.
Disorderd Regions
This section focueses on the parts of the structure that have no structure. What seems fairly easy to do, because its a byproduct of predicting the ordered parts, is quite an unmanageable topic. The tools provided to predict such regions are as diverse as those for any other structure prediction. <ref name=disorderpred> Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A. and Kryshtafovych, A. (2011), Evaluation of disorder predictions in CASP9. Proteins, 79: 107–118. doi: 10.1002/prot.23161</ref>
In order to have a closer look at the topic we decided not only to use iupred as it was specified in the task, but also to find other webservers which provide a similar service. A nice summary of the paper cited above can be found here, without the performance analysis from the paper. But since we want to this by ourselves that is quite sufficient.
iupred
iupred <uniprot-id>.fasta short > iupred<uinprot-id>.short iupred <uniprot-id>.fasta long > iupred<uinprot-id>.long iupred <uniprot-id>.fasta glob > iupred<uinprot-id>.glob
Spine-D
Spine-DM
Transmembrane Helices
Signalpeptides
We want to predict the presence of a signal peptide in our protein and some other example proteins. Since the programm SignalP uses different training sets for proteins from gram negative and gram positive bacteria and eukaryotes, it is necessary to determine the organism of our proteins. P47863 is from rattus norvegicus, the others from homo sapiens, so all are of the type euk. SignalP also suggests to only use the first 70-50 N-terminal amino acids, so our command to execute the prediction reads:
signalp -trunc 70 -t euk <UniprotID>.fasta > <UniprotID>.sigP_out
SignalP-NN reports five different values for a signalpeptide, cleavage site and combined scores as well as cut-offs to decide if this score hints at a signal peptide or not. SignalP-HMM reports overall probabilities for a signal peptide and a signal anchor and the most likely cleavage site. Table ? summarizes the output and compares it to experimantal confirmation from the Signal Peptide Database.
SignalP v3 | SignalP v4 | Signal Peptide Database | |||
---|---|---|---|---|---|
UniprotID | SignalP-HMM | SignalP_NN | summary prediction | summary prediction | experimental |
P00439 | 0% signal peptide or anchor | 5 times No | no signal peptide or anchor | no signal peptide or anchor | no evidence for peptide or anchor |
P02768 | 100% signal peptide, 0% signal anchor | 5 times Yes | signal peptide | signal peptide | confirmed signal peptide |
P11279 | 100% signal peptide, 0% signal anchor | 5 times Yes | signal peptide | signal peptide | confirmed signal peptide |
P47863 | 52.6% signal peptide, 45.7% signal anchor | 4 No, 1 Yes | signal peptide | no signal peptide or anchor | no evidence for peptide or anchor |
Table 4: Summary of the SignalP output and evidence from Signal Peptide Database for our dataset.
For PheOH (P00439) SignalP correctly predicts no signal peptide with high confidence. For P02769 and P11279 the signal peptide is detected with high confidence and even the correct cleavage site is predicted.
P47863 reaches a high maximal signal peptide score (S-score) in signalP-NN and probabilities above the cut-off values in signalP-HMM in version 3. In version 4 (available at their webserver) neither variant predicts a signal peptide for this protein and the probabilities/scores are well below the cut-off values. P47863 is a multipass membrane protein and signalPv3 most likely predicted a possible cleavage site for the N-terminal membrane region, classifying it wrongly as signal sequence. Using the whole sequence of P47863 raises one more of the scores of SignalP-NN above the prediction threshold and raises the probability for a signal peptide to 72.3%, showing that truncating the sequence leads to improved results.
Table 5 shows the n-,h- and c-regions predicted by SignalP-HMM (v3). P47863 shows a notably smaller probability for a cleavage site and smaller probability for a c-region, but looks similar enough to a sequence with signal peptide to make a prediction difficult.
<figtable id="tbl:signalp">
Table 5: Signal peptide regions of our dataset predicted by SignalP-HMM (v3). |
</figtable>
Functional Annotations
We used the webserver version of GOPET provided here to predict GO annotations and the websever version of ProtFun provided here to predict cellular role, enzyme class and functional GO categories for PheOH. ProtFun incorporates other tools to predict signal peptides, cleavage site, glycosylation, phosphorylation and transmembrane segments to achieve its main predictions and gives a summary output of the subtools. Some of these other tools we already used and discussed above, the others are discussed in this section.
GOPet
For this task, we increased the maximum number of reported annotations to 50 and left the other settings on the default values. This includes a confidence threshold for predictions set to 60%. GOPET only returns functional GO terms so the 'Aspect' in the result table 6 is always 'F'.
As can be seen in table 6, 11 annotations with a confidence of at least 71% were reported. The top three results are fairly general annotations and correctly reported with high confidence.
The next three results, phenylalanine 4-monooxygenase activity , tryptophan 5-monooxygenase activity and tyrosine 3-monooxygenase activity are quite similar and only the first is correct but has, of this three results, also the highest confidence. If we interpret this as "It's most likely that the protein has phenylalanine 4-monooxygenase activity, but could also have tryptophan 5- and tyrosine 3-monooxygenase activity" this statement appears quite accurate.
The results 7 to 10 concern ion binding and here the more general terms metal- and iron-binding are predicted correctly and with higher confidence than the other more specific ones. In other words, GOPET correctly predicted that PheOH binds an iron ion, but failed to predict the correct oxidation number. The last prediction, amino acid binding, is again correct but quite unspecific.
In conclusion, GOPET gives more general annotations with high accuracy and confidence, but fails to distinguish between very similar terms. Yet none of the reported terms are grossly misleading and if this were a truly unknown protein, the predictions would show clearly in the direction of the proteins true function.
GOPET predictions | ||||
---|---|---|---|---|
GO ID | Aspect | Confidence | GO Term | True/False |
GO:0003824 | F | 94% | catalytic activity | true |
GO:0016491 | F | 88% | oxidoreductase activity | true |
GO:0004497 | F | 87% | monooxygenase activity | true |
GO:0004505 | F | 84% | phenylalanine 4-monooxygenase activity | true |
GO:0004510 | F | 80% | tryptophan 5-monooxygenase activity | false |
GO:0004511 | F | 79% | tyrosine 3-monooxygenase activity | false |
GO:0046872 | F | 78% | metal ion binding | true |
GO:0005506 | F | 78% | iron ion binding | true |
GO:0008199 | F | 72% | ferric iron binding | false |
GO:0008198 | F | 72% | ferrous iron binding | false |
GO:0016597 | F | 71% | amino acid binding | true |
Table 6: GO term predictions for PheOH by GOPET.
ProtFun
ProtFun correctly reports, that PheOH is involved in the biosynthesis of amino acids. This is also (correctly) the only functional category with signifikant odds. ProtFun also correctly reports that PheOH is an enzyme, but fails to deduce the correct class. PheOH is an oxidoreductase with the EC number 1.14.16.1. The correct class has only the fourth highest odds of six possible classes.
In general, ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1, as is the case here. Indeed none of the categories fit PheOH.
# Functional category Prob Odds Amino_acid_biosynthesis => 0.210 9.530 Biosynthesis_of_cofactors 0.229 3.180 Cell_envelope 0.034 0.563 Cellular_processes 0.063 0.867 Central_intermediary_metabolism 0.061 0.970 Energy_metabolism 0.343 3.815 Fatty_acid_metabolism 0.025 1.889 Purines_and_pyrimidines 0.392 1.615 Regulatory_functions 0.020 0.125 Replication_and_transcription 0.118 0.438 Translation 0.204 4.630 Transport_and_binding 0.024 0.060 # Enzyme/nonenzyme Prob Odds Enzyme => 0.724 2.527 Nonenzyme 0.276 0.387 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.154 0.738 Transferase (EC 2.-.-.-) 0.271 0.785 Hydrolase (EC 3.-.-.-) 0.083 0.261 Lyase (EC 4.-.-.-) 0.047 1.002 Isomerase (EC 5.-.-.-) => 0.100 3.138 Ligase (EC 6.-.-.-) 0.019 0.370 # Gene Ontology category Prob Odds Signal_transducer 0.075 0.350 Receptor 0.003 0.016 Hormone 0.001 0.206 Structural_protein 0.005 0.166 Transporter 0.025 0.229 Ion_channel 0.010 0.168 Voltage-gated_ion_channel 0.005 0.232 Cation_channel 0.010 0.215 Transcription 0.043 0.334 Transcription_regulation 0.032 0.255 Stress_response 0.010 0.118 Immune_response 0.012 0.140 Growth_factor 0.006 0.407 Metal_ion_transport 0.009 0.020
Additional Predictions of ProtFun are shown below. Of the posttranslational modifications we could verify only one phosphorylation site at Ser16, that is thought to play a role in activity regulation. Three other phosphorylation sites reported at Phosphosite are not predicted. We could not find any positive information about glycosylation which seems to agree mostly with the predictions. PheOH contains no propeptide cleavage site as it is the mature protein and as discussed above contains no target or signal peptide and no TM helices.
Feature Output summary SignalP 3.0 No signal peptide cleavage site predicted ProP 1.0 1 propeptide cleavage site predicted at position: 74 TargetP 1.1 No high confidence targeting predition NetPhos 2.0 22 putative phosphorylation sites at positions 16 23 40 70 110 196 250 303 339 391 411 22 105 189 278 418 24 77 198 268 277 317 NetOGlyc 3.1 No O-glycosylated sites predicted NetNGlyc 1.0 2 putative N-glycosylated sites at positions 61 376 TMHMM 2.0 No TM helices predicted
Since ProtFun relies on these additional predictions to deduce functional annotations, but uses them as input in a multilayered neuronal network, it is hard to assess the importance of the individual results, especially if there is no confidence attached to them. Most of the predictions appear to be correct, but in the case of PheOH there is not much to predict. Accordingly, the main results of ProtFun exclude categories. The two most important results, PheOH might be an isomerase involved in amino acid biosynthesis, are only partially correct and somewhat misleading because a better result, oxidoreductase activity, is excluded.
PFAM
Searching among the PFAM families with the sequence of PheOH returns two families, ACT and Biopterin_H.
The ACT domain is supposed to be a regulatory domain that often appears in metabolic enzymes that are regulated by amino acid concentration. According to the PFAM description, a pair of ACT domains form an eight-stranded antiparallel sheet with two molecules of allosteric inhibitor serine bound in the interface. Our reference structure unfortunately only contains the structure from 117-424 but the predictions by PsiPred and ReProfSeq contain three to four regions that could form these antiparallel sheets.
The Biopterin_H domain belongs to a family of tetrahydrobiopterin-dependent aromatic amino acid hydroxylases, all of which are rate-limiting catalysts for important metabolic pathways. The proteins are regulated by phosphorylation at serines in their N-termini, a fact that we checked earlier for PheOH as ProtFun predicted phosphorylation sites including a Ser16.
References
<references/>