Difference between revisions of "Sequence-based analyses of ARS A"

From Bioinformatikpedia
(Prediction of GO Terms)
m (References)
 
(6 intermediate revisions by 2 users not shown)
Line 1,117: Line 1,117:
   
   
  +
===== Summary =====
  +
In most cases, the Predictions of GOPET give a good hint on the function of the protein. For
  +
* ''A4'' 8 out of 14
  +
* ''ARSA'' 6 out of 13
  +
* '' BACR'' 1 out of 3
  +
* ''INSL5'' 1 out of 1
  +
* ''LAMP1'' 0 out of 1
  +
* ''RET4'' 5 out of 8
  +
predictions were truely related to the function of the proteins. Due to the use of sequence searches and thus homology information the method will perform better for well characterized protein families. Predictions for new, previously unknown families will probably yield more false predictions.
   
 
==== Pfam ====
 
==== Pfam ====
  +
Pfam is a large database, which stores protein families, which are represented by multiple sequence alignments and hidden Markov models. We performed sequence searches of the proteins of interest and extracted the GO Terms associated with the resulting families from the database.
  +
 
* [http://pfam.sanger.ac.uk/ Pfam Server]
 
* [http://pfam.sanger.ac.uk/ Pfam Server]
   
Line 1,125: Line 1,136:
   
 
===== A4 =====
 
===== A4 =====
  +
 
[[Image:A4_PFAM.png]]
 
[[Image:A4_PFAM.png]]
 
[[Image:A4_PFAM_alignment.jpg]]
 
[[Image:A4_PFAM_alignment.jpg]]
Line 1,254: Line 1,266:
 
|-
 
|-
 
|}
 
|}
  +
  +
===== Summary =====
  +
The predictions using Pfam yield much less results, compared to GOPET. For
  +
* ''A4'' 1 out of 6
  +
* ''ARSA'' 1 out of 1
  +
* '' BACR'' 1 out of 1
  +
* ''INSL5'' 1 out of 1
  +
* ''LAMP1'' 1 out of 1
  +
* ''RET4'' 1 out of 1
  +
predictions were truely related to the function of the proteins. As in this method, also homology information is needed, the predictions for new, previously unknown families will probably yield more false predictions.
  +
   
   
Line 1,595: Line 1,618:
   
 
==== Discussion ====
 
==== Discussion ====
The results of GOPET were really detailled and mostly correct while the results of Pfam and ProtFun were not really detailled. Also the ProtFun results were mostly wrong.
+
GOPET clearly outperformed the other methods for the GO-Term prediction. Its results were really detailled and mostly correct, while the results of Pfam and ProtFun were rather very generalized terms than detailed functions. However, also Pfam yielded at least a general hint on what the protein does. Contrary, the ProtFun were mostly wrong and so general, that they are rather useless in predciting the real function of the protein. Furthermore, we could not map the terms to real GO-identifier.
   
 
== References ==
 
== References ==
 
<references />
 
<references />
  +
  +
  +
[[Category : Metachromatic_Leukodystrophy 2011]]

Latest revision as of 14:01, 29 March 2012

1. Secondary Structure Prediction

Secondary structure prediction methods normally predict three different structural states:

  • H (Helix): Helices are formed by short range interactions. The NH-group and the CO-group between two nearby amino acids form an H-bond and stabilize the structure. Three different H-bond patterns can be observed for the helical strucutere, which are interactions between amino acid i and i+3 ((3-10)-Helix), i+4 (alpha-Helix) and i+5 (Phi-Helix). As the stabilising interactions for this structural feature are near in sequence it is relatively easy to predict, even if only statistical properties are considered.
  • E (beta-sheet): Beta Sheets are stabilised by long range interactions through the formation of H-bonds. As these interactions are between amino acids, that are not near in sequence, they are harder to predict than helices. Also here different patterns of H-bond formation can be observed. Parallel beta-sheets are formed by interactions between a residue i with the residues j-1 and j+1. Anti-parallel sheets are formed by interactions between residue i with j.
  • C (coils): Coils are striuctural elements, which generally do not follow a recurrent pattern, i.e. they are untstructured. Normally these parts of proteins are rather flexible.


In the following two popular methods for secondary strucutre prediction are shortly intruduced and the applied to Arylsulfatase A. Predictions are then compared to the assignments of DSSP.

PSI-PRED

PSI-PRED was developed by David T. Jones in 1998. It requires a protein sequence in FASTA format. Then it performs a PSI-BLAST search and creates a sequence profile from the result. These profile capture evolutionary information about the set of proteins found and improves secondary structure prediction compared to the first or second generation methods, which only use statistcal properties of single amino acids or sliding windows.
The sequnce profile is then fed to a neural network with a feed-forward back-propagation architecture. The network consits of an input and output layer and a single hidden layer. The output of the first network then serves as input of a second network, which filters the prediction of the first network and yields the final prediction.
Together with the 3-state secondary structure prediction, PSI-PRED calculates confidence scores for the predicted structural elements. Thus, the user might identify false predictions within low-quality predicted regions.
The average Q3 score, reached by PSI-PRED is 80,3 %. <ref name="psipred">Jones, D. T.. "[Protein secondary structure prediction based on position-specific scoring matrices.]". J Mol Biol, 1999</ref>

Jpred

Jpred was developed by Cole, Barber and Barton in 1998. It also uses a neural network to predict secondary structure. The prediction relies on the Jnet algorithm, wich either takes a multiple sequence alignment or a single sequence as input. If a single sequence is passed to the program, Jpred also uses sequence profiles derived from a PSI-BLAST search.The output is presented as coloured HTML, plain text, PostScript, PDF and via the Jalview alignment editor. It also calculates confidence scores for all predicted residues to allow the user to identify possible false predictions. It reaches an average Q3 score up to 81,5 %. <ref name="jpred">Cole, C. and Barber, J. D. and Barton, G. J.. "[The Jpred 3 secondary structure prediction server.]". Nucleic Acid Res, 2008</ref>

DSSP

DSSP is a database of protein secondary structure assignments for all proteins in PDB. It was developed by Kabsch and Sander in 1983. It takes the 3D coordinates of a protein and assigns a hierarchical definition of secondary structure elements to the protein, based on different H-bond patterns. H-bonds are assigned below a specified cutoff, whereby the free energy is calculated using the Coulomb energy hydrogen bond model. <ref name="dssp">Kabsch W. and Sander C.. "[Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.]". Biopolymers, 1983</ref>
As stated above, DSSP assigns a hierachical definition of secondary structure and therefore the assignment contains more structural classes than the 3-state prediction (H=helix, E=sheet, C=coil) of PSI-PRED and Jpred. To be able to compare the predictions to the assignemnt of DSSP, its output must be converted to the three letter classification. We achieved this through writing a perl script. The following table depicts DSSP classes, their description and the "3-letter-class", we converted it to.

DSSP class Description ' 3-letter class
H Helix H
G 3-10 Helix H
I Phi-Helix H
B single bridge E
E beta sheet E
T turn C
S bend C
\s coil C

Results and Discussion

We predicted secondary structure of Arylsulfatase A with PSI-BLAST and Jpred3 using the Webserver user interface. Further on, we downloaded the DSSP secondary structure assignment and converted the hirachical definitions to the 3-state classification as described above. The predctions, together with the DSSP assignments are shown below.

The Figure depicts predictions of PSI-PRED and JPred, together with the DSSP output. Helices are shown as bars, beta-sheets as ">" and coils as ".". True positive predictions are shown in green


Furthermore a textual representation of the predictions and assignments can be seen below. Missing residues in the DSSP output are marked by an "m".


mmmmmmmmmmmmmmmmmmCCCEEEEEEECCCCCCCCHHHCCCCCCCHHHHHHHHCCEEECCEECCCCCHHHHHHHHHHCCCHHHHCC (DSSP)
CCHHHHHHHHHHHCCCCCCCCCEEEEEEECCCCCCCCCCCCCCCCCHHHHHHHHCCCEECCCCCCCCCCHHHHHHHHHCCCCCCCCC (JPRED)
CCHHHHHHHHHHHHCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCCCCCCCCHHHHHHHHHCCCCCCCCC (PSI-PRED)

CCCCCCCCECCECCCCCCCHHHHHHCCCCEEEEEECCCCECCHHHCCCHHHHCCCEEEECCCCCCCCECCCCEEECCCEECCCCECC (DSSP)
CCCCCCCCCCCCCCCCCCHHHHHHHCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC (JPRED)
CCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC (PSI-PRED)

CCCCCCEEECCEEEEECCCHHHHHHHHHHHHHHHHHHHHHCCCCEEEEEECCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHH (DSSP)
CCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHH (JPRED)
CCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHH (PSI-PRED)

HHHHHHCCCHHHEEEEEEECCCCCHHHHHHCCCCCCCCCCCCCCCHHHHECCCEEECCCCCCCEEECCCEEHHHHHHHHHHHHCCCC (DSSP)
HHHHHHCCCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEECCCCCCCCCECCCCCCCCHHHHHHHHHCCCC (JPRED)
HHHHHHCCCCCCEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEECCCCCCCCEECCCHHHHHHHHHHHHHHCCCC (PSI-PRED)

CCCCCCCCCCHHHHHCCCCCCCCEEEECCCCCCCCCCCCEEEECCEEEEEEECCCHHHCCCCCHHHCCCCCCEEEEEEEEEECCCCC (DSSP)
CCCCCCCCCCCCCCCCCCCCCCCEEEEECCCCCCCCCCEEEEECCCEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCC (JPRED)
CCCCCCCCCCHHHHCCCCCCCCCEEEECCCCCCCCCCEEEEEECCCEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCEEEECCCCC (PSI-PRED)

CCCCCCCCCmmmCCHHHHHHHHHHHHHHHHHHHHCCCCCCCHHHCECHHHCCCCCCCCCCCCCCCCECmmmm (DSSP)
CCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC (JPRED)
CCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC (PSI-PRED)


Both methods show a good performance on the main part of the protein with an overall accurcy of 74 % for PSI-PRED and an accuracy of 71 % for Jpred3. All predicted sheets and helices highly overlap with assignments by DSSP. The errors of both predictions are mainly false negatives predictions of beta sheets. Further details of the prediction accuracy can be extracted from the table below:

Program #TP #FP accuracy
PSI-PRED 374 133 0.74
Jpred 359 148 0.71

The accuracy (Q3) in these prediction is around 10 % lower than the average Q3 scores in the original publications of PSI-PRED and Jpred. Both methods predict the wrong secondary structure for the region from around position 110-200. Unfortunately the confidence scores generated by PSI-PRED and JPred are also rather high within this region, thus one is not able to identify this region as possibly false prediction from the prediction results alone.
DSSP assigns very short helices and beta sheets in this regions. Perhaps these are too short for a proper prediction.

Prediction and confidence scores for PSI-PRED.
Confidence scroes of the Jpred prediction.

2. Prediction of Disordered Regions

Disordered regions are regions in proteins that do not fold as expected and can be coil-like, globule-like, molten or something in between. In other words they can be defined definedregions of proteins that lack a fixed tertiary structure - i.e. are intrinsically unstructured. Often these disordered regions are important for binding and become ordered if the protei is associated with its cognate molecule.<ref>M.Rani, P.Romero, Z. Obradovic, A. Dunker. "Annotation of PDB with respect to "Disordered Regions" in Proteins" Download</ref>

Three different servers were challenged to predict disordered regions in ARSA, but no region was found that is consistently predicted disorderd by all three methods.

DISOPRED

DISOPRED was developed by David Jones et al. in 2004. DISOPRED2 uses Support Vector Machines for Disoredered region prediction. It was trained on a set of around 750 non-redundant sequences with high resolution X-ray structures. As disordered regions cannot easily be determined experimental methods, disorder is simply assigned to missing regions in the resolved structure. For the training and classification of the SVM, sequence profiles are used. <ref name="disopred">Ward, J. J. and McGuffin, L. J. and Bryson, K. and Buxton, B. F. and Jones, D. T. "[The DISOPRED server for the prediction of protein disorder.]". Bioinformatics, 2004</ref>
We used the DISOPRED server for the prediction:

Output of Disopred showing the probability of being disordered along the sequence
DISOPRED predictions for a false positive rate threshold of: 2%

conf: 930000000000012210000000000000000000000000000000000000000000
pred: *...........................................................
  AA: MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT
              10        20        30        40        50        60

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM
              70        80        90       100       110       120

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP
             130       140       150       160       170       180

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE
             190       200       210       220       230       240

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC
             250       260       270       280       290       300

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP
             310       320       330       340       350       360

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL
             370       380       390       400       410       420

conf: 000000000000000000000000000000000000000000000000000000000000
pred: ............................................................
  AA: TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG
             430       440       450       460       470       480

conf: 000000000000000002571699999
pred: ......................*****
  AA: EDPALQICCHPGCTPRPACCHCPDPHA
             490       500

Asterisks (*) represent disorder predictions and dots (.) 
prediction of order. The confidence estimates give a rough
indication of the probability that each residue is disordered.

As we already know the structure of ARSA, we do not expect long disordered regions. Possibly, only short loops might be identified as unstructured elements by DISOPRED. As you can see, only the first residue and the five last residues are predicted to be in a disordered region with very high confidence.
This prediction makes sense, if we consider the DSSP assignment for this region. DSSP assigns coils for the last residues of the protein, which are secondary structure alements which do not follow specific structural patterns by definition. The rest of the coils within the protein is probably not predicted to be unstructured because they are located within the core and thus do not allow structural flexibility.

POODLE

plot of POODLE-output showing the probability of being disordered along the sequence


POODLE predicts many disordered residues. Depending on the treshold one can identify 6 or more disordered regions. Like DISOPRED it predicts Disorder at the end of the protein. Please see the previous section for a discussion.
The disorder prediction for the beginning also makes sense, as DSSP assigns a loop region there. The disorder predictions at residues 150-200 and 400-450 are clearly false as the overlap with helices and beta-sheets. Thus these regions are structured.

IUPred

IUPred was developed by Simon et al. in 2005. <ref name="iupred">Zsuzsanna Dosztanyi, Veronika Csizmak, Peter Tompa and Istvan Simon . "[The Pairwise Energy Content Estimated from Amino Acid Composition Discriminates between Folded and Intrinsically Unstructured Proteins .]". J Mol Biol, 2005</ref>
It estimates the total pairwise interaction energy from the amino acid composition of the protein. This energy space can clearly discriminate between disordered and structured regions.

The three different options of prediction were tried and are illustrated below. In general, IUPred did not predict any disordered region with a "Disorder tendency" above 0.6 except one very short region around residue 415 with the "long disorder"-option.

long disorder

The main profile of our server is to predict context-independent global disorder that encompasses at least 30 consecutive residues of predicted disorder. For this application the sequential neighbourhood of 100 residues is considered. <ref name="IUPred"> http://iupred.enzim.hu/Help.html</ref>

IUPred-output showing the probability of being disordered along the sequence with "long disorder"-option


short disorder

It uses a parameter set suited for predicting short, probably context-dependent, disordered regions, such as missing residues in the X-ray structure of an otherwise globular protein. For this application the sequential neighbourhood of 25 residues is considered. As chain termini of globular proteins are often disordered in X-ray structures, this is taken into account by an end-adjustment parameter which favors disorder prediction at the ends. <ref name="IUPred"> http://iupred.enzim.hu/Help.html</ref>

IUPred-output showing the probability of being disordered along the sequence with "short disorder"-option
structured domains

The dependable identification of ordered regions is a crucial step in target selection for structural studies and structural genomics projects. Finding putative structured domains suitable for stucture determination is another potential application of this server. In this case the algorithm takes the energy profile and finds continuous regions confidently predicted ordered. Neighbouring regions close to each other are merged, while regions shorter than the minimal domain size of at least 30 residues are ignored. When this prediction type is selected, the region(s) predicted to correspond to structured/globular domains are returned. <ref name="IUPred"> http://iupred.enzim.hu/Help.html</ref>

IUPred-output showing the probability of being disordered along the sequence with "structured domains"-option
Summary

No mode of IUPred predicts any disordered region. This makes perfectly sense as ARSA does not contain disordered region, if we disregard loops.

Meta-Disorder

We used the PredictProtein Webserver for the prediction:

The result is shown below ("D"=disordered, "-"=not disordered)


---------------------------------------------------------------------------------------------------- 
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
-------

Also Meta-Disorder did not predict any disordered regions for this ARSA.

Discussion

The only server that predicted significant disordered regions was POODLE. However the predictions by POODLE for the highly overlap with secondary structure elements of the protein, thus they are wrong. The other methods agreed in the result that no disordered regions can be found except for the ends of the sequence, which makes sense as the end of the protein consists of an unstructured loop.

3. Prediction of transmembrane alpha-helices and signal peptides

The prediction of membrane proteins and their topology is very important, because the experimental determination of these protein is quite challenging. It is very dificult to determine the structure, because the influence of membrane mimetic environments might lead to non-native structures and thus lead to a wrongf structural model of the protein. <ref>Cross, Timothy, Mukesh Sharma, Myunggi Yi, Huan-Xiang Zhou (2010). "Influence of Solubilizing Environments on Membrane Protein Structures"</ref>

Additional Proteins

The following proteins are additionally used for the prediction of transmembrand alpha-helices and signal peptides and for the prediction of GO Terms:

BACR

BACR_HALSA is a bacterial membrane protein. It is involved in hydrogen ion transport and sensory transduction transport. The topology of cellular and transmembrane domains of the protein is shown below:

type Position ' Description
Topological domain 14 – 23 Extracellular
Transmembrane 24 – 42 Helical; Name=Helix A
Topological domain 43 – 56 Cytoplasmic
Transmembrane 57 – 75 Helical; Name=Helix B
Topological domain 76 – 91 Extracellular
Transmembrane 92 – 109 Helical; Name=Helix C
Topological domain 110 – 120 Cytoplasmic
Transmembrane 121 – 140 Helical; Name=Helix D
Topological domain 141 – 147 Extracellular
Transmembrane 148 – 167 Helical; Name=Helix E
Topological domain 168 – 185 Cytoplasmic
Transmembrane 186 – 204 Helical; Name=Helix F
Topological domain 205 – 216 Extracellular
Transmembrane 217 – 236 Helical; Name=Helix G
Topological domain 237 – 262 Cytoplasmic
RET 4
  • RET4_HUMAN is a human retinal-binding protein. It delivers retinol from the liver stores to the peripheral tissues. Defects can cause night vision problems.
INSL 5
  • INSL5_HUMAN is a human insulin-like peptide. It consists of two chains and may have a role in gut contractility or in thymic development and regulation.
LAMP 1
  • LAMP1_HUMAN is a human membrane glycoprotein. It presents cabohydrate ligands to selectins. It is located in the lysosomal membrane. Its topology is shown below:
type Position ' Description
Topological Domain 29 - 382 Lumenal
Transmembrane 383 - 405 Helical
Topological Domain 406 - 417 Cytoplasmic
Region 29 - 194 First lumenal domain
Region 195 - 227 Hinge
Region 228 - 382 Second lumenal domain
A 4
  • A4_HUMAN is a human cell surface receptor involved in neurite growth, neuronal adhesion and axonogenesis. It is involved in Alzheimer disease and Amyloidosis.
type Position ' Description
Topological domain 18 - 699 Extracellular
Transmembrane 700 - 723 Helical
Topological domain 724 - 770 Cytoplasmic
Domain 291 - 341 BPTI / Kunitz inhibitor
Region 96 - 110 Heparin-binding
Region 181 - 188 Zinc-binding
Region 391 - 423 Heparin-binding
Region 491 - 522 Heparin-binding
Region 523 - 540 Collagen-binding
Region 732 - 751 Interaction with G(o)-alpha
Motif 724 - 734 Basolateral sorting signal
Motif 759 - 762 NPXY motif; contains endocytosis signal
Compositional bias 230 - 260 Asp/Glu-rich (acidic)
Compositional bias 274 - 280 Poly-Thr


SignalP

SignalP uses a neural network and a HMM to calculate three different scores <ref name="signalp">Bendtsen, J. D. and Nielsen, H. and von Heijne, G. and Brunak, S.. "[Improved prediction of signal peptides: SignalP 3.0.]". J Mol Biol, 2004</ref>:

  • S-score (=signal peptide score): High values indicate the presence of a signal peptides in the sequence.
  • C-Score (=raw cleavage site score): This score is used to recognize the cleavage site.
  • Y-score (= combined cleavage site score): This score optimizes the prediction of the cleavage site by considering the C-score and the S-score simultaneously. A cleavage site is predicted, if the C-score is high and the S-score is low.Optimierung des cleavage site scores.

We executed SignalP with the following commands:
sudo /apps/signalp-3.0/signalp -t gram- ../BACR.fasta > BACR.signalp
sudo /apps/signalp-3.0/signalp -t euk ../ARSA.fasta > ARSA.signalp
sudo /apps/signalp-3.0/signalp -t euk ../A4.fasta > A4.signalp
sudo /apps/signalp-3.0/signalp -t euk ../LAMP1.fasta > LAMP1.signalp
sudo /apps/signalp-3.0/signalp -t euk ../INSL5.fasta > INSL5.signalp
sudo /apps/signalp-3.0/signalp -t euk ../RET4.fasta > RET4.signalp

The graphical output of the method is shown below:

ARS A A4 RET4 INSL5 LAMP1 BACR
ARSA.jpeg A4.jpeg RET4.jpeg INSL5.jpeg LAMP1.jpeg BACR.jpeg


Discussion

The cleavage sites and signal peptides of all proteins are correctly predicted, compard to the UniprotKB annotation. In general the output of the neural network gives a more distinct prediction of the different regions. The bacterial membrane protein BACR does not contain a signal peptide, regarding the annotation of UniprotKB and SignalP does not predict one, but the S-score is very high between position 20-40, which is a transmembrane helix. This is due to the similar properties of signal peptides and transmembrane helices, which both exhibit a bias towards hydrophobic amino acids. But lacking the characteristics of a cleavage site, SignalP does not predict a Signalpeptide here. This shows that the program is able to properly distinguish between transmebrane helices and signal peptides.

TMHMM

TMHMM predicts transmembrane helices (TMH) using a Hidden Markov Model (HMM). The protein described by TMH model essentially consists of seven different states. Globular domains can occur on the cytoplasmic and the non-cytoplasmic side. On the cytoplsmic side, globular domains are linked to loops, ehich are agin linked to cytoplasimc caps. These caps are followed by the helex core and there is again a cap on the non-cytoplasmic side. These caps are linked to globular domains by either short or long non-cytoplasmic loops.
TMHMM outputs the most likely structure of the protein, ragarding to the above model. It also includes the orientation (cytoplasmic or non-cytoplasmic side) of the N-terminal signal sequence. The ouput consists of a plot - graphically showing the different states along the protein - and some additional statistics <ref> http://www.cbs.dtu.dk/services/TMHMM-2.0/TMHMM2.0.guide.html#output </ref>:

  • The number of predicted transmembrane helices.
  • The expected number of amino acids in transmembrane helices. If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
  • The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein. If this number more than a few, you should be warned that a predicted transmembrane helix in the N-term could be a signal peptide.
  • The total probability that the N-term is on the cytoplasmic side of the membrane.


Protein #predicted TMHs #expected AAs in TMHs #expected AAs in TMHs in first 60 positions orientation (N-term at non-cyto. side) Graphical output
ARS A 0 2.65106 2.63079 0.12149 Sp P15289 ARSA HUMAN.gif
A4_HUMAN 1 22.72525 0.0027 0.00015 Sp P05067 A4 HUMAN.gif
INSL5_HUMAN 0 0.50415 0.50415 0.03772 Sp Q9Y5Q6 INSL5 HUMAN.gif
LAMP1_HUMAN 2 44.89582 22.24286 0.99287 Sp P11279 LAMP1 HUMAN.gif
RET4_HUMAN 0 0.01196 0.01179 0.01909 Sp P02753 RET4 HUMAN.gif
BACR 6 140.4032 26.1196 0.01887 Sp P02945 BACR HALSA.gif


Discussion
  • ARS A: All amino acids are predicted to be "outside" the membrane, which is consistent with the UniprotKB annotation, as ARS A is not a membrane protein. The graphical output of TMHMM shows, that the probaility for a transmembrane helix is elevated at the start of the protein, which is due to the hydrophobicity of the signal peptide.
  • A4_HUMAN: This protein contains exactly one transmembrane helix which is located from postion 700-723. TMHMM predicts the transmembrane helix at 701-723, which is quite stifying. The predited topology is given below:
Description Position '
outside 1-700
TMhelix 701-723
inside 724-770
  • INSL5_HUMAN: This protein does not contain any transmembrane helices and none are predicted by TMHMM.
  • LAMP1_HUMAN: TMHMM predicts a Possible N-terminal signal sequence and two potential transmembrane helices. LAMP1 indeed contains a N-terminal signal peptide, but the prediction of the first transmembrane helix is false. This false positive prediction overlaps to 50% with the signal peptide, which is located from position 1-22. However, the second predicted TM-helix highly overlaps with the annotated transmebrane helix in UniprotKB.
Description Position '
inside 1-10
TMhelix 11-33
outside 34-383
TMhelix 384-406
inside 407-417
  • RET4_HUMAN: No TM-helices are predicted, which coincides with the annotation.
  • BACR: A Possible N-terminal signal sequence is predicted, which is false. BACR contains 7 TM-helices. TMHMM only predicts 6 transmembrane helices, which highly overlap with the annotation. The prediction misses the last TM-helix in the protein. Despite the false prediction, the graphical output shows, that the probybility is quite high in this region.
Description Position '
outside 1-22
TMhelix 23-42
inside 43-54
TMhelix 55-77
outside 78-91
TMhelix 92-114
inside 115-120
TMhelix 121-143
outside 144-147
TMhelix 148-170
inside 171-189
TMhelix 190-212
outside 213-262

Phobius and Polyphobius

Phobius combines SignalP-HMM and TMHMM for the prediction of transmembrane proteins and their topology. The Hidden Markov Models of both progams are simply associated via the last state of SignalP-HMM and the non-cytoplasmic loop state of TMHMM. This is justified, because most signal peptides are located at the non-cytoplasmic side of the membrane, but it also limits the detection of proteins with the opposite location. <ref name="phobius">Kall, L. and Krogh, A. and Sonnhammer, E. L.. "[A combined transmembrane topology and signal peptide prediction method.]". J Mol Biol, 2004</ref>
Polyphobius extends the approach of Phobius by incorporating information from homologs using global alignments. <ref name="polyphobius">Kall, L. and Krogh, A. and Sonnhammer, E. L.. "[An HMM posterior decoder for sequence feature prediction that includes homology information.]". Bioinformatics, 2005</ref>


Protein Phobius - graphical Phobius - textual Polyphobius - graphical Polyphobius - textual
A4 Leuko A4 phobius.png FT SIGNAL 1 17

FT REGION 1 1 N-REGION.
FT REGION 2 12 H-REGION.
FT REGION 13 17 C-REGION.
FT TOPO_DOM 18 700 NON CYTOPLASMIC.
FT TRANSMEM 701 723
FT TOPO_DOM 724 770 CYTOPLASMIC.

Leuka A4 polyphobius.png FT SIGNAL 1 17

FT REGION 1 3 N-REGION.
FT REGION 4 12 H-REGION.
FT REGION 13 17 C-REGION.
FT TOPO_DOM 18 700 NON CYTOPLASMIC.
FT TRANSMEM 701 723
FT TOPO_DOM 724 770 CYTOPLASMIC.

ARS A Leuko arsa phobius.png FT SIGNAL 1 28

FT REGION 1 6 N-REGION.
FT REGION 7 18 H-REGION.
FT REGION 19 28 C-REGION.
FT TOPO_DOM 29 507 NON CYTOPLASMIC.

Leuka arsa polyphobius.png FT SIGNAL 1 16

FT REGION 1 4 N-REGION.
FT REGION 5 12 H-REGION.
FT REGION 13 16 C-REGION.
FT TOPO_DOM 17 507 NON CYTOPLASMIC.

BACR Leuko bacr phobius.png FT TOPO_DOM 1 22 NON CYTOPLASMIC.

FT TRANSMEM 23 42
FT TOPO_DOM 43 53 CYTOPLASMIC.
FT TRANSMEM 54 76
FT TOPO_DOM 77 95 NON CYTOPLASMIC.
FT TRANSMEM 96 114
FT TOPO_DOM 115 120 CYTOPLASMIC.
FT TRANSMEM 121 142
FT TOPO_DOM 143 147 NON CYTOPLASMIC.
FT TRANSMEM 148 169
FT TOPO_DOM 170 189 CYTOPLASMIC.
FT TRANSMEM 190 212
FT TOPO_DOM 213 217 NON CYTOPLASMIC.
FT TRANSMEM 218 237
FT TOPO_DOM 238 262 CYTOPLASMIC.

Leuko bacr polyphobius.png FT TOPO_DOM 1 21 NON CYTOPLASMIC.

FT TRANSMEM 22 43
FT TOPO_DOM 44 54 CYTOPLASMIC.
FT TRANSMEM 55 77
FT TOPO_DOM 78 94 NON CYTOPLASMIC.
FT TRANSMEM 95 114
FT TOPO_DOM 115 120 CYTOPLASMIC.
FT TRANSMEM 121 141
FT TOPO_DOM 142 147 NON CYTOPLASMIC.
FT TRANSMEM 148 166
FT TOPO_DOM 167 186 CYTOPLASMIC.
FT TRANSMEM 187 205
FT TOPO_DOM 206 215 NON CYTOPLASMIC.
FT TRANSMEM 216 237
FT TOPO_DOM 238 262 CYTOPLASMIC.

INSL5 Leuko insl5 phobius.png FT SIGNAL 1 22

FT REGION 1 5 N-REGION.
FT REGION 6 17 H-REGION.
FT REGION 18 22 C-REGION.
FT TOPO_DOM 23 135 NON CYTOPLASMIC.

Leuko insl5 polyphobius.png FT SIGNAL 1 22

FT REGION 1 4 N-REGION.
FT REGION 5 16 H-REGION.
FT REGION 17 22 C-REGION.
FT TOPO_DOM 23 135 NON CYTOPLASMIC.

LAMP1 Leuko lamp1 phobius.png FT SIGNAL 1 28

FT REGION 1 10 N-REGION.
FT REGION 11 22 H-REGION.
FT REGION 23 28 C-REGION.
FT TOPO_DOM 29 381 NON CYTOPLASMIC.
FT TRANSMEM 382 405
FT TOPO_DOM 406 417 CYTOPLASMIC.

Leuko lamp1 polyphobius.png FT SIGNAL 1 28

FT REGION 1 9 N-REGION.
FT REGION 10 22 H-REGION.
FT REGION 23 28 C-REGION.
FT TOPO_DOM 29 381 NON CYTOPLASMIC.
FT TRANSMEM 382 405
FT TOPO_DOM 406 417 CYTOPLASMIC.

RET4 Leuko ret4 phobius.png FT SIGNAL 1 18

FT REGION 1 2 N-REGION.
FT REGION 3 13 H-REGION.
FT REGION 14 18 C-REGION.
FT TOPO_DOM 19 201 NON CYTOPLASMIC.

Leuko ret4 polyphobius.png FT SIGNAL 1 18

FT REGION 1 3 N-REGION.
FT REGION 4 13 H-REGION.
FT REGION 14 18 C-REGION.
FT TOPO_DOM 19 201 NON CYTOPLASMIC.

Discussion
  • ARS A: The prediction of the signal peptide of Phobius is too long (1-28), whereas the prediction of Polyphobius is slighlty too short (1-16). Regarding to the annotaion, ARS A contains a signal peptide from position 1-18. Both methods don't predict a transmembrane helix.
  • A4_HUMAN: Both methods correctly predict the location of the signal peptide. The prediction of the transmembrane helix only misses the first amino acid. This prediction is quite good.
  • INSL5_HUMAN: Phobius and Polyphobius correctly predict the signal peptide from position 1-22 and no TM-helix.
  • LAMP1_HUMAN: Both methods correctly predict the location of the signal peptide. The prediction of the transmembrane helix contains an additional amino acid at the start.
  • RET4_HUMAN: Phobius and Polyphobius correctly predict the signal peptide from position 1-18 and no TM-helix.
  • BACR: Both methods predict 7 transmembrane helices, which almost perfectly overlap with the annotation.


Phobius and Polyphobius yield very similar results. The only improvement for Polyphobius - which can be seen from our analysed proteins - is in the prediction of the location of the signal peptide for ARS A.

OCTOPUS and SPOCTOPUS

OCPTOPUS uses a combination of a Hidden Markov Model and neural network to predict the topology of a transmembrane protein. It uses BALST to create a sequence profile, whihc is then used by the neural network to predict the preference of the amino acids to be located within a transmembrane (M), interface (I), close loop (L) globular loop (G), inside (i) or outside (o). These scores are then passed to the HMM, which predicts the final states. <ref name="octopus">Viklund, H. and Elofsson, A.. "[OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar.]" Bioinformatics, 2008</ref>
SPOCTOPUS extends the OCTOPUS algorithm with a preprocessing step. OCTOPUS does not predict signal peptides. The N-terminal targeting sequences mainly consist of hydrophobic residues and thus thier properties strongly resemble the transmembrane helices. Not considering the signal peptides in the prediction often leads to a false prediction of a transmembrane helix at the N-terminal domain. Therefore SPOCTOPUS extends the OCTOPUS algorithm with the prediction of signal peptide preference scores within the first 70 amino acids of the protein. The exact location of a potential signal peptide are then predicted by a HMM in OCTOPUS. <ref name="spoctopus">Viklund, H. and Bernsel, A. and Skwark, M. and Elofsson, A.. "[SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology.]" Bioinformatics, 2008</ref>


Protein OCTOPUS SPOCTOPUS
ARS A Octopus arsa leuko.png Spoctopus arsa leuko.png
A4 Octopus a4 leuko.png Spoctopus a4 leuko.png
RET4 Octopus ret4 leuko.png Spoctopus ret4 leuko.png
INSL5 Octopus insl5 leuko.png Spoctopus insl5 leuko.png
LAMP1 Octopus lamp1 leuko.png Spoctopus lamp1 leuko.png
BACR Octopus bacr leuko.png Spoctopus bacr leuko.png
Discussion
  • ARS A: OCTOPUS does not predict any TM-regions or the signal peptide. SPOCTOPUS correctly predicts the signal peptide.
  • A4_HUMAN: Both methods correctly predict the location of the TM-helix. SPOCTOPUS also correctly predicts the signal peptide. OCTOPUS predicts a REENTRANT/DIP region at the end of the signal peptide.
  • RET4_HUMAN: OCTOPUS confounds the signal peptide with a TM-helix. Contrary, SPOCTOPUS correctly predicts a signal peptide.
  • INSL5_HUMAN: OCTOPUS confounds the signal peptide with a TM-helix. Contrary, SPOCTOPUS correctly predicts a signal peptide.
  • LAMP1_HUMAN: Both methods correctly predict the location of TM-helix. Again, OCTOPUS confounds the signal peptide with a TM-helix, but SPOCTOPUS yields a correct prediction.
  • BACR: Both methods predict 7 transmembrane helices, which almost perfectly overlap with the annotation.


The difference between OCTOPUS and SPOCTOPUS can be clearly seen in the predictions. As mentioned above, OCTOPUS does not include the prediction of signal peptides and thus confounds signal peptides with TM-helices.

TargetP

TargetP is used to predict the cellular localization of a protein. It combines the two methods ChloroP and SignalP. The following targeting sequences can be identified:

  • chloroplast transit peptide (cTP)
  • mitochondrial targeting peptide (mTP)
  • secretory pathway signal peptide (SP)

TargetP uses a neural network to calculate and outputs scores for each of the above subcellular targets. TargetP finally predicts the location with the highest score. In our case all proteins are predicted to be targeted to the secretory pathway (S). Results are shown below. Note, that cTP is not included in our predictions, as we only considered eukaryotic and bacterial proteins. Also note, that TargetP is trained on eukaryotic proteins and hence the prediction for the protein "BACR", which is bacterial does not make sense, because there are completely different pathways of localization and secretion in eukayotes and bacteria (e.g. bacteria do not have an endoplasmatic reticulum, Golgi-Apparatus or Lysosome). Nevertheless, we included it in our analysis to see if TargetP predicts finds any localization sequence in it or predicts "-" (= no localization signal found). <ref name="targetp">Emanuelsson, O. and Nielsen, H. and Brunak, S. and von Heijne, G.. "[Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.]". J Mol Biol, 2000</ref>
Our prediction results are dpicted in the following table:

Protein mTP SP other prediction
ARS A 0.070 0.926 0.054 S
A4_HUMAN 0.035 0.937 0.084 S
INSL5_HUMAN 0.074 0.899 0.037 S
LAMP1_HUMAN 0.043 0.953 0.017 S
RET4_HUMAN 0.242 0.928 0.020 S
BACR (bacterial) 0.019 0.897 0.562 S
Discussion

All proteins are assigned to the secretory pathway.

  • Arylsulfatase A is a lysosomal enzyme. Therefore, the prediction is correct, as lysosomal proteins are guided there by the secretory pathway, via the endoplasmatic reticulum and the Golgi apparatus.
  • A4 is a membrane protein. Thus the prediction is correct.
  • INSL5 is also a secreted (regarding UniprotKB) and the prediction is correct.
  • LAMP1 is targeted to the cellular membrane. Thus the prediction is correct.
  • RET4 delivers retinol from the liver stores to the peripheral tissues. It is a secretory protein, thus the prediciton is correct.
  • As described above, BACR is a bacterial protein. TargetP assigns, that this protein is also guided to the secretory pathway, which makes no sense as the bacterial protein secretion is different from eukaryotic secretion. Nevertheless, the prediction is much less obvious in this case, compared to the others. The "other" class - meaning that no targeting sequence is found in the protein gets a considerable high score in this prediction, hence the assignment to S is more questionable here.

4. Prediction of GO Terms

The aim of the gene ontology is to standardize the representation of genes and gene products. The gene ontology is divided into three major parts: cellular component, molecular function and biological process. For us, the molecular function is interesting as it describes activities of a protein. Each GO-Term stands for a specific function, so predicting GO-Terms means prediction of protein-function. <ref>http://www.geneontology.org</ref>

GOPET

GOPET uses homology searches and a SVM to predict GO-Terms.

GO-Terms for 6 different proteins were predicted. The results are shown below. Bold entries are GO-Terms which are really connected to the protein. <ref>http://www.ebi.ac.uk/QuickGO/</ref>

A4
GOid Confidence GO term
GO:0004866 87% endopeptidase inhibitor activity
GO:0004867 86% serine-type endopeptidase inhibitor activity
GO:0030568 83% plasmin inhibitor activity
GO:0030304 83% trypsin inhibitor activity
GO:0030414 82% peptidase inhibitor activity
GO:0005488 79% binding
GO:0005515 74% protein binding
GO:0046872 73% metal ion binding
GO:0003677 71% DNA binding
GO:0008201 70% heparin binding
GO:0008270 69% zinc ion binding
GO:0005507 69% copper ion binding
GO:0005506 67% iron ion binding


ARS A
GOid Confidence GO term
GO:0003824 97% catalytic activity
GO:0016787 96% hydrolase activity
GO:0008484 95% sulfuric ester hydrolase activity
GO:0004065 92% arylsulfatase activity
GO:0004098 89% cerebroside-sulfatase activity
GO:0003943 83% N-acetylgalactosamine-4-sulfatase activity
GO:0004773 82% steryl-sulfatase activity
GO:0004423 82% iduronate-2-sulfatase activity
GO:0008449 82% N-acetylglucosamine-6-sulfatase activity
GO:0047753 82% choline-sulfatase activity
GO:0018741 81% alkyl sulfatase activity
GO:0046872 63% metal ion binding
GO:0016250 61% N-sulfoglucosamine sulfohydrolase activity


BACR_HALSA
GOid Confidence GO term
GO:0005216 77% ion channel activity
GO:0008020 75% G-protein coupled photoreceptor activity
GO:0015078 60% hydrogen ion transmembrane transporter activity


INSL 5
GOid Confidence GO term
GO:0005179 80% hormone activity


LAMP 1
GOid Confidence GO term
GO:0004812 60% aminoacyl-tRNA ligase activity
GO:0005524 60% ATP binding


RET 4
GOid Confidence GO term
GO:0005488 90% binding
GO:0005501 81% retinoid binding
GO:0008289 80% lipid binding
GO:0019841 78% retinol binding
GO:0005215 78% transporter activity
GO:0016918 78% retinal binding
GO:0005319 69% lipid transporter activity
GO:0008035 60% high-density lipoprotein particle binding


Summary

In most cases, the Predictions of GOPET give a good hint on the function of the protein. For

  • A4 8 out of 14
  • ARSA 6 out of 13
  • BACR 1 out of 3
  • INSL5 1 out of 1
  • LAMP1 0 out of 1
  • RET4 5 out of 8

predictions were truely related to the function of the proteins. Due to the use of sequence searches and thus homology information the method will perform better for well characterized protein families. Predictions for new, previously unknown families will probably yield more false predictions.

Pfam

Pfam is a large database, which stores protein families, which are represented by multiple sequence alignments and hidden Markov models. We performed sequence searches of the proteins of interest and extracted the GO Terms associated with the resulting families from the database.

GO-Terms for 6 different proteins were predicted. The results are shown below. The results were mapped to GO-terms by the list pfam2go <ref>http://www.geneontology.org/external2go/pfam2go</ref> Bold entries are GO-Terms which are really connected to the protein. <ref>http://www.ebi.ac.uk/QuickGO/</ref>

A4

A4 PFAM.png A4 PFAM alignment.jpg

PFAM-Family PFAM-Description GOid GO term
APP_N Amyloid-A4 N-terminal heparin-binding
APP_Cu_bd Copper-binding of amyloid precursor, CuBD
Kunitz_BPTI Kunitz / Bovine pancreatic trypsin inhibitor domain GO:0004867 serine-type endopeptidase inhibitor activity
APP_E2 E2 domain of amyloid precursor protein
Beta-APP Beta-amyloid peptide (Beta-APP)
APP_amyloid beta-amyloid precursor protein C-terminus


ARS A

ARSA PFAM.png ARSA PFAM alignment.jpg


PFAM-Family PFAM-Description GOid GO term
Sulfatase Sulfatase GO:0008484 sulfuric ester hydrolase activity
BACR HALSA

BACR HALSA PFAM.png BACR HALSA PFAM alignment.jpg

PFAM-Family PFAM-Description GOid GO term
Bac_rhodopsin Bacteriorhodopsin-like protein GO:0005216, GO:0006811, GO:0016020 ion channel activity, ion transport, membrane
INSL 5

INSL5 PFAM.png INSL5 PFAM alignment.jpg

PFAM-Family PFAM-Description GOid GO term
Insulin Insulin / IGF / Relaxin family GO:0005179, GO:0005576 hormone activity, extracellular region
LAMP 1

LAMP1 PFAM.png LAMP1 PFAM alignment.jpg

PFAM-Family PFAM-Description GOid GO term
Lamp Lysosome-associated membrane glycoprotein (Lamp) GO:0016020 membrane
RET 4

RET4 PFAM.png RET4 PFAM alignment.jpg

PFAM-Family PFAM-Description GOid GO term
Lipocalin Lipocalin / cytosolic fatty-acid binding protein family GO:0005488 binding
Summary

The predictions using Pfam yield much less results, compared to GOPET. For

  • A4 1 out of 6
  • ARSA 1 out of 1
  • BACR 1 out of 1
  • INSL5 1 out of 1
  • LAMP1 1 out of 1
  • RET4 1 out of 1

predictions were truely related to the function of the proteins. As in this method, also homology information is needed, the predictions for new, previously unknown families will probably yield more false predictions.



ProtFun 2.2

ProtFun queries other feature prediction servers and integrates all the results in its final prediction.


A4
############## ProtFun 2.2 predictions ##############

>sp_P05067_A

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.020    0.921
  Biosynthesis_of_cofactors            0.261    3.623
  Cell_envelope                     => 0.804   13.186
  Cellular_processes                   0.053    0.730
  Central_intermediary_metabolism      0.184    2.920
  Energy_metabolism                    0.023    0.259
  Fatty_acid_metabolism                0.016    1.265
  Purines_and_pyrimidines              0.417    1.716
  Regulatory_functions                 0.013    0.084
  Replication_and_transcription        0.029    0.109
  Translation                          0.027    0.613
  Transport_and_binding                0.827    2.016

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                            => 0.392    1.368
  Nonenzyme                            0.608    0.852

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.024    0.114
  Transferase    (EC 2.-.-.-)          0.208    0.603
  Hydrolase      (EC 3.-.-.-)          0.190    0.600
  Lyase          (EC 4.-.-.-)          0.020    0.430
  Isomerase      (EC 5.-.-.-)          0.010    0.324
  Ligase         (EC 6.-.-.-)          0.048    0.946

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.126    0.586
  Receptor                             0.036    0.211
  Hormone                              0.001    0.206
  Structural_protein                => 0.034    1.205
  Transporter                          0.024    0.222
  Ion_channel                          0.009    0.162
  Voltage-gated_ion_channel            0.002    0.108
  Cation_channel                       0.010    0.215
  Transcription                        0.043    0.335
  Transcription_regulation             0.018    0.143
  Stress_response                      0.076    0.862
  Immune_response                      0.016    0.183
  Growth_factor                        0.005    0.372
  Metal_ion_transport                  0.009    0.020

//

It was not possible to map this ProtFun-result to a Gene Ontology category.


ARS A
############## ProtFun 2.2 predictions ##############

>sp_P15289_A

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.015    0.669
  Biosynthesis_of_cofactors            0.048    0.668
  Cell_envelope                     => 0.804   13.186
  Cellular_processes                   0.027    0.373
  Central_intermediary_metabolism      0.404    6.416
  Energy_metabolism                    0.050    0.555
  Fatty_acid_metabolism                0.028    2.138
  Purines_and_pyrimidines              0.404    1.662
  Regulatory_functions                 0.013    0.081
  Replication_and_transcription        0.021    0.080
  Translation                          0.032    0.717
  Transport_and_binding                0.821    2.002

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                            => 0.540    1.886
  Nonenzyme                            0.460    0.644

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.063    0.304
  Transferase    (EC 2.-.-.-)          0.062    0.180
  Hydrolase      (EC 3.-.-.-)          0.313    0.987
  Lyase          (EC 4.-.-.-)          0.038    0.803
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.206    0.965
  Receptor                             0.111    0.652
  Hormone                              0.002    0.323
  Structural_protein                   0.005    0.177
  Transporter                          0.025    0.229
  Ion_channel                          0.009    0.154
  Voltage-gated_ion_channel            0.003    0.139
  Cation_channel                       0.010    0.215
  Transcription                        0.037    0.287
  Transcription_regulation             0.018    0.142
  Stress_response                   => 0.102    1.158
  Immune_response                      0.022    0.259
  Growth_factor                        0.005    0.391
  Metal_ion_transport                  0.009    0.020

//


It was not possible to map this ProtFun-result to a Gene Ontology category.


BACR HALSA
############## ProtFun 2.2 predictions ##############

>sp_P02945_B

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.033    1.495
  Biosynthesis_of_cofactors            0.186    2.589
  Cell_envelope                        0.029    0.483
  Cellular_processes                   0.051    0.694
  Central_intermediary_metabolism      0.045    0.711
  Energy_metabolism                    0.138    1.537
  Fatty_acid_metabolism                0.016    1.265
  Purines_and_pyrimidines              0.302    1.244
  Regulatory_functions                 0.013    0.080
  Replication_and_transcription        0.019    0.073
  Translation                          0.059    1.339
  Transport_and_binding             => 0.791    1.929

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                               0.199    0.696
  Nonenzyme                         => 0.801    1.122

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.114    0.549
  Transferase    (EC 2.-.-.-)          0.031    0.091
  Hydrolase      (EC 3.-.-.-)          0.057    0.180
  Lyase          (EC 4.-.-.-)          0.020    0.430
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.258    1.205
  Receptor                             0.355    2.087
  Hormone                              0.001    0.206
  Structural_protein                   0.006    0.200
  Transporter                       => 0.440    4.036
  Ion_channel                          0.010    0.169
  Voltage-gated_ion_channel            0.004    0.172
  Cation_channel                       0.078    1.689
  Transcription                        0.026    0.205
  Transcription_regulation             0.028    0.226
  Stress_response                      0.012    0.139
  Immune_response                      0.011    0.128
  Growth_factor                        0.010    0.727
  Metal_ion_transport                  0.049    0.106

//

"Transporter" corresponds to GO:0005215 which is not annotated in http://www.ebi.ac.uk/QuickGO/GProtein?ac=P02945 but if you look closer you can see that BACR_HAlSA is a ion transporter, so the classification is true.


INSL 5
############## ProtFun 2.2 predictions ##############

>sp_Q9Y5Q6_I

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.011    0.484
  Biosynthesis_of_cofactors            0.040    0.558
  Cell_envelope                     => 0.756   12.393
  Cellular_processes                   0.033    0.448
  Central_intermediary_metabolism      0.048    0.755
  Energy_metabolism                    0.036    0.397
  Fatty_acid_metabolism                0.016    1.265
  Purines_and_pyrimidines              0.144    0.592
  Regulatory_functions                 0.014    0.087
  Replication_and_transcription        0.020    0.075
  Translation                          0.032    0.735
  Transport_and_binding                0.834    2.033

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                               0.209    0.729
  Nonenzyme                         => 0.791    1.109

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.056    0.268
  Transferase    (EC 2.-.-.-)          0.031    0.091
  Hydrolase      (EC 3.-.-.-)          0.062    0.195
  Lyase          (EC 4.-.-.-)          0.020    0.430
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)          0.017    0.327

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.374    1.746
  Receptor                             0.128    0.750
  Hormone                           => 0.247   37.936
  Structural_protein                   0.001    0.041
  Transporter                          0.025    0.228
  Ion_channel                          0.010    0.168
  Voltage-gated_ion_channel            0.003    0.131
  Cation_channel                       0.010    0.215
  Transcription                        0.054    0.425
  Transcription_regulation             0.091    0.724
  Stress_response                      0.099    1.128
  Immune_response                      0.178    2.090
  Growth_factor                        0.061    4.379
  Metal_ion_transport                  0.009    0.020

//

"Hormone" corresponds to GO:0005179 which is annotated in http://www.ebi.ac.uk/QuickGO/GProtein?ac=Q9Y5Q6 and due to that considered as true.


LAMP 1
############## ProtFun 2.2 predictions ##############

>sp_P11279_L

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.011    0.484
  Biosynthesis_of_cofactors            0.053    0.735
  Cell_envelope                     => 0.804   13.186
  Cellular_processes                   0.027    0.373
  Central_intermediary_metabolism      0.138    2.188
  Energy_metabolism                    0.037    0.411
  Fatty_acid_metabolism                0.016    1.265
  Purines_and_pyrimidines              0.533    2.195
  Regulatory_functions                 0.015    0.090
  Replication_and_transcription        0.019    0.073
  Translation                          0.027    0.613
  Transport_and_binding                0.834    2.033

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                               0.276    0.965
  Nonenzyme                         => 0.724    1.014

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.039    0.187
  Transferase    (EC 2.-.-.-)          0.046    0.134
  Hydrolase      (EC 3.-.-.-)          0.058    0.184
  Lyase          (EC 4.-.-.-)          0.020    0.430
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.396    1.849
  Receptor                             0.282    1.659
  Hormone                              0.001    0.206
  Structural_protein                   0.011    0.408
  Transporter                          0.024    0.222
  Ion_channel                          0.008    0.147
  Voltage-gated_ion_channel            0.002    0.111
  Cation_channel                       0.010    0.215
  Transcription                        0.032    0.247
  Transcription_regulation             0.018    0.142
  Stress_response                      0.246    2.795
  Immune_response                   => 0.371    4.368
  Growth_factor                        0.013    0.956
  Metal_ion_transport                  0.009    0.020

//

"Immune responds" corresponds to GO:0006955 which is not annotated in http://www.ebi.ac.uk/QuickGO/GProtein?ac=P11279 .


RET 4
############## ProtFun 2.2 predictions ##############

>sp_P02753_R

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.017    0.751
  Biosynthesis_of_cofactors            0.044    0.610
  Cell_envelope                     => 0.804   13.186
  Cellular_processes                   0.075    1.021
  Central_intermediary_metabolism      0.197    3.128
  Energy_metabolism                    0.043    0.475
  Fatty_acid_metabolism                0.016    1.265
  Purines_and_pyrimidines              0.275    1.131
  Regulatory_functions                 0.013    0.080
  Replication_and_transcription        0.022    0.084
  Translation                          0.032    0.721
  Transport_and_binding                0.800    1.951

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                            => 0.544    1.900
  Nonenzyme                            0.456    0.639

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.095    0.458
  Transferase    (EC 2.-.-.-)          0.038    0.109
  Hydrolase      (EC 3.-.-.-)          0.235    0.742
  Lyase          (EC 4.-.-.-)       => 0.059    1.264
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.202    0.942
  Receptor                             0.147    0.862
  Hormone                              0.004    0.667
  Structural_protein                   0.002    0.058
  Transporter                          0.025    0.232
  Ion_channel                          0.016    0.288
  Voltage-gated_ion_channel            0.003    0.148
  Cation_channel                       0.010    0.215
  Transcription                        0.027    0.207
  Transcription_regulation             0.025    0.196
  Stress_response                      0.161    1.829
  Immune_response                   => 0.239    2.813
  Growth_factor                        0.023    1.617
  Metal_ion_transport                  0.009    0.020

//

"Immune responds" corresponds to GO:0006955 which is not annotated in http://www.ebi.ac.uk/QuickGO/GProtein?ac=P02753 .


Discussion

GOPET clearly outperformed the other methods for the GO-Term prediction. Its results were really detailled and mostly correct, while the results of Pfam and ProtFun were rather very generalized terms than detailed functions. However, also Pfam yielded at least a general hint on what the protein does. Contrary, the ProtFun were mostly wrong and so general, that they are rather useless in predciting the real function of the protein. Furthermore, we could not map the terms to real GO-identifier.

References

<references />