^{by Benjamin Drexler and Fabian Grandke}

Find additional information and graphics here.

Proteins

Protein	GLA	BACR_HALSA	RET4_HUMAN	INSL5_HUMAN	LAMP1_HUMAN	A4_HUMAN
Organism	Homo sapiens	Halobacterium salinarium	Homo sapiens	Homo sapiens	Homo sapiens	Homo sapiens
Size	429 AA	262 AA	201 AA	135 AA	417 AA	770 AA
Subcellular location	Lysosome	Cell membrane	Secreted	Secreted	Cell membrane	Membrane
Function	Glycosidase	Photoreceptor protein	Sensory transduction	Hormone	Carbohydrate presentation	Serine protease inhibitor
Transmembrane	No	7 transmembrane regions	No	No	1 transmembrane region	1 transmembrane region
UniProt entry	P06280	P02945	P02753	Q9Y5Q6	P11279	P05067

Secondary structure prediction

PSIPRED

http://bioinf.cs.ucl.ac.uk/psipred/

PSIPRED was developed by David T. Jones at the University of Warwick in 1998. Nowadays the server runs at the University College London. <ref name=PSIPRED>History of PSIPRED</ref>

PSIPRED predicts secondary structures based on neuronal networks with a single hidden layer and feed-forward back-propagation.<ref name=Praesi>Talk_Task3</ref> The workflow can be split into three states:

Sequence profiles generation: Neuronal network gets position-specific matrix from PSI-Blast as input
Initial secondary structure prediction: Output Layer predicts one of the three secondary structures
Predicted structure filtering: Additional network filters the raw predictions from the previous step

Prediction

Figure 1:PSIPRED result for GLA

Figure 1 shows 10 alpha helices, 16 beta strands and 27 coils.

Jpred3

http://www.compbio.dundee.ac.uk/www-jpred/index.html

Jpred3 was developed by C. Cole at the University of Dundee. Similar to PSIPRED a neuronal network is used for the prediction of the secondary structure. For single sequences as input the program uses PSI-Blast sequence profiles, as well. Jpred3 is also capable of taking multiple sequence alignments as input. Both are further processed using the Jpred algorithm. <ref name=jpred>Cole et al., " The Jpred 3 secondary structure prediction server.", Nucleic acids research. 2008 Jul 1, PubMed</ref>

The table below shows the PDB entries found by Jpred3 concerning our input sequence:

EBI	Chain	Describtion	E-value
3hg5	B	Alpha-galactosidase A	0.0
3hg5	A	Alpha-galactosidase A	0.0
3hg4	B	Alpha-galactosidase A	0.0
3hg4	A	Alpha-galactosidase A	0.0
3hg2	B	Alpha-galactosidase A	0.0
3hg2	A	Alpha-galactosidase A	0.0
3gxt	B	Alpha-galactosidase A	0.0
3gxt	A	Alpha-galactosidase A	0.0
3gxp	B	Alpha-galactosidase A	0.0
3gxp	A	Alpha-galactosidase A	0.0
3gxn	B	Alpha-galactosidase A	0.0
3gxn	A	Alpha-galactosidase A	0.0
1r47	B	Alpha-galactosidase A	0.0
1r47	A	Alpha-galactosidase A	0.0
1r46	B	Alpha-galactosidase A	0.0
1r46	A	Alpha-galactosidase A	0.0
3hg3	B	Alpha-galactosidase A	0.0
3hg3	A	Alpha-galactosidase A	0.0
3lxc	B	Alpha-galactosidase A	0.0
3lxc	A	Alpha-galactosidase A	0.0
3lxb	B	Alpha-galactosidase A	0.0
3lxb	A	Alpha-galactosidase A	0.0
3lxa	B	Alpha-galactosidase A	0.0
3lxa	A	Alpha-galactosidase A	0.0
3lx9	B	Alpha-galactosidase A	0.0
3lx9	A	Alpha-galactosidase A	0.0
1ktc	A	alpha-N-acetylgalactosaminidase	e-113
1ktb	A	alpha-N-acetylgalactosaminidase	e-113
3igu	B	Alpha-N-acetylgalactosaminidase	e-100
3igu	A	Alpha-N-acetylgalactosaminidase	e-100
3h55	B	Alpha-N-acetylgalactosaminidase	e-100
3h55	A	Alpha-N-acetylgalactosaminidase	e-100
3h54	B	Alpha-N-acetylgalactosaminidase	e-100
3h54	A	Alpha-N-acetylgalactosaminidase	e-100
3h53	B	Alpha-N-acetylgalactosaminidase	e-100
3h53	A	Alpha-N-acetylgalactosaminidase	e-100

The lightblue colored protein is the protein that was used as query sequence.

Comparison with DSSP

http://swift.cmbi.ru.nl/servers/html/

DSSP was developed by W.Kabsch and C. Sander in 1983. It is a database, containing secondary structure assignments for each protein in PDB. As it is no prediction program itself, it is used to compare the results of a prediction with the data in DSSP. Therefor it uses the 3D coordinates from PDB entries and used them to calculate DSSP entries.<ref name="dssp">Kabsch et al., "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.", Biopolymers, 1983,PubMed</ref>

Figure 2: Result of DSSP

Find a pdf version of this Figure 2 here: File:GLA DSSP Comp.pdf

Results

Structural Element	PSIPRED	Jpred3	DSSP
Helices	10	10	12
Strands	16	15	16
Coils	27	26	29

Prediction of disordered regions

DISOPRED

http://bioinf.cs.ucl.ac.uk/disopred/

DISOPRED was developed by JJ. Ward, JS. Sodhi, LJ. McGuffin, BF. Buxton and DT. Jones in 2004. A neuronal network is used to predict disordered regions. As it is a knowledge-based method, the DISOPRED neuronal network is trained with X-ray structures from PDB. The program takes a sequence as input and runs PSI-Blast against a database. The trained neuronal network predicts residuewise profiles and classifies them as disordered or not disordered. <ref name=disopred>Ward et al., "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life", Journal of Molecular Biology. 2004, PubMed</ref> <ref name=Praesi>Talk_Task3</ref>

POODLE

http://mbs.cbrc.jp/poodle/poodle.html

POODLE was developed by S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda and T. Noguchi in 2007. It predicts disordered regions by usage of a machine-learning approach.

There are four different variants available.:<ref name=poodlehp>POODLE Help Page</ref>

POODLE-L<ref name=poodle_l>Hirose et al., " POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.", PharmaDesign. 2007, PubMed</ref>: Predicts long disordered regions(>40 consecutive residues).

POODLE-I<ref name=poodle_i>POODLE-I</ref>: Uses structural information predictors based on a work-flow approach.

POODLE-S<ref name=poodle_s>Shimizu et al., "POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.", Bioinformatics. 2007, PubMed</ref>: Predicts short disordered regions. Has two subversions differing in the preparation of the databases:
- Missing residues: Missing regions in X-ray structures.
- High B-Factor residues: Regions with high B-Factors.

POODLE-W<ref name=poodle_w>Shimizu et al., "Predicting mostly disordered proteins by using structure-unknown protein data

.", Bioinformatics. 2007, PubMed</ref>: Specialized on mostly disordered proteins.

For the analysis only POODLE-S was used, because not even short disordered regions were found in our protein, so long regions are even more unlikely.

POODLE-S: Missing residues

POODLE-S: High B-Factor residues

IUPRED

http://iupred.enzim.hu/index.html

IUPRED was developed by Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon in 2005. It is a prediction method for ordered and disordered regions in protein sequences. IUPRED is based on an energy function and the assumption that there are less interresidue interactions in disordered regions. <ref name=poodle_l>Dosztányi et al., "IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.", Bioinformatics. 2005, PubMed</ref>

Short Disorder

Long Disorder

META-Disorder

http://www.predictprotein.org/

Hint: You will have to register. It is free of charge, but you can submit max. 3 sequences within the next 12 months!

Metadisorder was developed by Avner Schlessinger and Burkhard Rost in 2005 at the columbia university. It combines different methods and uses various sources of information to predict disordered regions. Metadisorder makes use of the methods described below <ref name=MD>Schlessinger et al., "Improved Disorder Prediction by Combination of Orthogonal Approaches.", PLoS ONE. 2009, PLoS ONE</ref> <ref name=rostlab_MD>Rostlab - Metadisorder</ref>:

PROFbval

PROFbval is a residue mobility prediction method based on the amino-acid sequence. <ref name=rostlab_prof>Rostlab - PROFbval</ref>

NORSnet

NORSnet is a method that identifies unstructured loops, based on neuronal networks. <ref name=rostlab_nors>Rostlab - NORSnet</ref>

UCON

UCON predicts natively unstructured regions through contacts. <ref name=rostlab_ucon>Rostlab - UCON</ref>

Prediction of transmembrane alpha-helices and signal peptides

Programs

Describtion of the programs:

TMHMM

TMHMM is a program that predicts transmembrane helices in proteins. It is based on a hidden Markov model and was developed by Sonnhammer et. al in 1998<ref name=sonnhammer>Sonnhammer et al., "A hidden Markov model for predicting transmembrane helices in protein sequences.", Proc Int Conf Intell Syst Mol Biol. 1998, PubMed</ref>.

We used the webserver of TMHMM with the FASTA-sequence of the protein as input. The output contains some statistics (e.g. number of TM helices, expected number of AAs in TM helices, probability that the N-term is on the cytoplasmic membrane), a listening of the labeled sequence areas (e.g. inside, outside, TM helix) and a plot of the probabilities for the residues. Additionally, if it is predicted that there is a great number of the first 60 AA are part of a TM helix, TMHMM will indicate that there could be a signal peptide in the N-term region.

Phobius/Polyphobius

Phobius was developed by Käll, Krogh and Sonnhammer in 2004 at the Stockholm Bioinformatics Center. It is based on HMMs and predicts transmembrane helices and signal-peptides at the N-terminal. Phobius takes a protein sequence(FASTA format) as input an outputs diagrams and text files containing the predictions. <ref name=phobius>Käll et al., "A combined transmembrane topology and signal peptide prediction method.", J Mol Biol. 2004, PubMed</ref>

Polyphobius was developed by L. Käll, A. Krogh, EL. Sonnhammer in 2005 at the Stockholm Bioinformatics Center. It is very similar to Phobius and uses HMMs, as well. Additionally Polyphobius uses information from homologous sequences to improve the prediction accuracy. <ref name=polyphobius>Käll et al., "An HMM posterior decoder for sequence feature prediction that includes homology information.", Bioinformatics. 2005, PubMed</ref>

Octopus/Spoctopus

Octopus was developed by H. Viklund and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It combines HMMs and artificial neuronal networks(ANN) to predict the topology of transmembrane proteins. At first several ANNs are used to make predictions for every single residue and afterwards HMMs are used to smooth the results and combine them to a useful prediction.<ref name=octopus>Viklund et al., "OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar.", Bioinformatics. 2008, PubMed</ref>

Spoctopus was developed by H. Viklund, A. Bernsel, M. Skwark and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It is almost identical to Octopus, but additionally predicts the signal peptide.<ref name=spoctopus>Viklund et al., "SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology.", Bioinformatics. 2008, PubMed</ref>

TargetP

TargetP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It uses neuronal networks to predict the subcellular location of eukaryotic proteins. The method is based on the predicted presence of any of the N-terminal presequences (i.e.:chloroplast transit peptide, mitochondrial targeting peptide or secretory pathway signal peptide).<ref name=targetp>Emanuellson et al., "Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.", J Mol Biol. 2000, PubMed</ref>

SignalP

SignalP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It predicts presence of a signal peptide and provides information about hte exact location of signal peptide cleavage sites. The most recent version of the method uses as well AMMs as HMMs. <ref name=signalp>Nielsen et al., "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.", Protein engineering. 1997, PubMed</ref>

Proteins

In the following sections we present the results of the programs for the proteins, i.e. the prediction of transmembrane alpha-helices and signal peptides. The sequence annotation of the UniProt entries (see protein overview) were used as the reference. For the colouring of the tables, we used the following scheme:

green (completely correct): the predicted positions are allowed to differ one position per index, e.g. a prediction of the positions 19 - 31 will be marked green, if the reference is 20 - 30.
- exception: TMHMM per se is not able to predict signal peptides and hence it will include the N-terminal of the protein in the first region almost always even though it is a signal peptide. Therefore we also declare the prediction of the first region correct, when it starts at the first residue and the end residue of the first region is correct.
yellow (partial correct): a vast majority of the residues is assigned the correct region.
red (wrong): the prediction is completely wrong, e.g. a region is predicted which does not exist.

General evaluation

TMHMM does not predict signal peptides and therefore it often includes the beginning of the protein into the first region by mistake. It also made an error by missing a transmembrane alpha-helix and Phobius/Polyphobius often had more accurate predictions.

Octopus has a similar problem to TMHMM, since it also does not predict signale peptides and hence it predicts a transmembrane alpha-helix for the signal peptide. On the other hand, Spoctopus predicts the transmembrane regions very good, but is always missing the beginning of the signal peptide.

Phobius and Polyphobius made very good predictions in respect to transmembrane alpha-helices and the predictions of the signal peptides were completely correct. Since it delivers the whole package (e.g. transmembrane alpha-helix prediction, signal peptide prediction, highst accurarcy), we prefer to work with these programs.

GLA

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Cytoplasmic	-	-	-	-	1 - 9	-
Transmembrane	-	-	-	-	10 - 30	-
Non-cytoplasmic	32 - 429	1 - 429	32 - 429	32 - 429	31 - 429	32 - 429

No program predicted a transmembrane alpha-helix except for Octopus which predicts the signal peptide as a cytoplasmic and transmembrane region. We also observe this problem of Octopus for all of the other five proteins.

Signal peptides

Region	Reference	Phobius	Polyphobius	Spoctopus	SignalP
Signal peptide	1 - 31	1 - 31	1 - 31	11 - 31	1 - 31
N-Region	-	1 - 9	1 - 12	-	-
H-Region	-	10 - 22	13 - 26	-	-
C-Region	-	23 - 31	27 - 31	-	-

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

BACR_HALSA

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Non-cytoplasmic	14 - 23	1 - 22	1 - 22	1 - 21	1 - 22	1 - 22
Transmembrane	24 - 42	23 - 42	23 - 42	22 - 43	23 - 43	23 - 43
Cytoplasmic	43 - 56	43 - 54	43 - 53	44 - 54	44 - 54	44 - 54
Transmembrane	57 - 75	55 - 77	54 - 76	55 - 77	55 - 75	55 - 75
Non-cytoplasmic	76 - 91	78 - 91	77 - 95	78 - 94	76 - 95	76 - 95
Transmembrane	92 - 109	92 - 114	96 - 114	95 - 114	96 - 116	96 - 116
Cytoplasmic	110 - 120	115 - 120	115 - 120	115 - 120	117 - 121	117 - 120
Transmembrane	121 - 140	121 - 143	121 - 142	121 - 141	122 - 142	121 - 141
Non-cytoplasmic	141 - 147	144 - 147	143 - 147	142 - 147	143 - 147	142 - 147
Transmembrane	148 - 167	148 - 170	148 - 169	148 - 166	148 - 168	148 - 168
Cytoplasmic	168 - 185	171 - 189	170 - 189	167 - 186	169 - 185	169 - 185
Transmembrane	186 - 204	190 - 212	190 - 212	187 - 205	186 - 206	186 - 206
Non-cytoplasmic	205 - 216	213 - 262	213 - 217	206 - 215	207 - 216	207 - 216
Transmembrane	217 - 236	-	218 - 237	216 - 237	217 - 237	217 - 237
Cytoplasmic	237 - 262	-	238 - 262	238 - 262	238 - 262	238 - 262

All programs predict the correct number of transmembrane alpha-helices apart from TMHMM which misses the last one. Polyphobius has the best result since it has the highest number of completely correct predicted regions.

Signal peptides

No program predicted a signal peptide which is correct.

RET4_HUMAN

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Cytoplasmic	-	-	-	-	1 - 1	-
Transmembrane	-	-	-	-	2 - 23	-
Non-cytoplasmic	19 - 201	1 - 201	19 - 201	19 - 201	24 - 201	20 - 201

All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.

Signal peptides

Region	Reference	Phobius	Polyphobius	Spoctopus	SignalP
Signal peptide	1 - 18	1 - 18	1 - 18	6 - 19	1 - 18
N-Region	-	1 - 2	1 - 3	-	-
H-Region	-	3 - 13	4 - 13	-	-
C-Region	-	14 - 18	14 - 18	-	-

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

INSL5_HUMAN

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Cytoplasmic	-	-	-	-	1 - 1	-
Transmembrane	-	-	-	-	2 - 32	-
Non-cytoplasmic	23 - 135	1 - 135	23 - 135	23 - 135	33 - 135	24 - 135

All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.

Signal peptides

Region	Reference	Phobius	Polyphobius	Spoctopus	SignalP
Signal peptide	1 - 22	1 - 22	1 - 22	6 - 23	1 - 22
N-Region	-	1 - 5	1 - 4	-	-
H-Region	-	6 - 17	5 - 16	-	-
C-Region	-	18 - 22	17 - 22	-	-

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

LAMP1_HUMAN

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Cytoplasmic	-	1 - 10	-	-	1 - 10	-
Transmembrane	-	11 - 33	-	-	11 - 31	-
Non-cytoplasmic	29 - 382	34 - 383	29 - 381	29 - 381	32 - 383	30 - 383
Transmembrane	383 - 405	384 - 406	382 - 405	382 - 405	384 - 404	384 - 404
Cytoplasmic	406 - 417	407 - 417	406 - 417	406 - 417	405 - 417	405 - 417

Phobius, Polyphobius and Spoctopus predict the one transmembrane alpha-helix correctly. Octopus has the common error that it assigns a cytoplasmic and transmembrane region to the signal peptide of the protein. TMHMM also predicts a transmembrane region for the signal peptide. The developers of TMHMM indicate this problem on the instruction page of TMHMM.

Signal peptides

Region	Reference	Phobius	Polyphobius	Spoctopus	SignalP
Signal peptide	1 - 28	1 - 28	1 - 28	12 - 29	1 - 28
N-Region	-	1 - 10	1 - 9	-	-
H-Region	-	11 - 22	10 - 22	-	-
C-Region	-	23 - 28	23 - 28	-	-

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

A4_HUMAN

Transmembrane alpha-helices

Region	Reference	TMHMM	Phobius	Polyphobius	Octopus	Spoctopus
Cytoplasmic	-	-	-	-	1 - 5	-
Transmembrane	-	-	-	-	6 - 11	-
Non-cytoplasmic	18 - 699	1 - 700	18 - 700	18 - 700	12 - 701	19 - 701
Transmembrane	700 - 723	701 - 723	701 - 723	701 - 723	702 - 722	702 - 722
Cytoplasmic	724 - 770	724 - 770	724 - 770	724 - 770	723 - 770	723 - 770

Every program predicts the one transmembrane alpha-helix of the protein except for Octopus.

Signal peptides

Region	Reference	Phobius	Polyphobius	Spoctopus	SignalP
Signal peptide	1 - 17	1 - 17	1 - 17	5 - 18	1 - 17
N-Region	-	1 - 1	1 - 3	-	-
H-Region	-	2 - 12	4 - 12	-	-
C-Region	-	13 - 17	13 - 17	-	-

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

TargetP

http://www.cbs.dtu.dk/services/TargetP/

Name	Length	mTP	SP	other	Loc	RC
GLA	429	0.041	0.860	0.141	S	2
BACR_HALSA	262	0.019	0.897	0.562	S	4
RET4_HUMAN	201	0.242	0.928	0.020	S	2
INSL5_HUMA	135	0.074	0.899	0.037	S	1
LAMP1_HUMA	417	0.043	0.953	0.017	S	1
A4_HUMAN	770	0.035	0.937	0.084	S	1

http://www.cbs.dtu.dk/services/TargetP-1.1/output.php

Prediction of GO terms

GOPET

GOPET stands for Gene Ontology term Prediction and Evaluation Tool and was developed by Vinayagam et al. in 2006<ref name=vinayagam>Vinayagam et al., "GOPET: a tool for automated predictions of Gene Ontology terms.", BMC Bioinformatics. 2006 Mar 20, PubMed</ref>. It is based on homology searches on GO-mapped protein databases and uses support vector machines for the calculation of the confidence values.

We used the webserver of GOPET with the default settings (GO aspect: molecular function, maximum number of predictions: 20, confidence threshold: 60, GOPET model 2007 june, version 2.0, GOPET database 2007) and the FASTA-sequence of the protein as input. The results only contain GOids of the GO aspect "molecular function", since the other two GO aspects (cellular component and biological process) were not available.

Pfam

Pfam is a database composed of the protein domain families that is created by using Hidden Markov Models profiles (HMMs) and was first described by Sonnhammer et al. in 1997<ref name=sonnhammer>Sonnhammer et al., "Pfam: a comprehensive database of protein domain families based on seed alignments.", Proteins. 1997 Jul, PubMed</ref>. Each protein domain family is represented by a multiple sequence alignment and a HMMs. One can search one protein sequence against Pfam and obtain all the possible domains that the query sequence might contain.

Pfam database includes two parts A and B where the protein domain families with different quality levels. In the 1.0 release of Pfam, the protein entries in Pfam-A and Pfam-B were from Swissprot (a few initial members of seed alignment in Pfam-A were from several sources: Swissprot, Prosite, ProDom etc.). In the current release of Pfam, the entries in Pfam-A and Pfam-B are from Pfamseq(UniProtKB) and ADDA respectively.

The Pfam-A contains the well characterized entries with annotation. It starts with the building of the seed alignment with a few selected representative sequence members under manually quality checking. Then the HMMs is applied automatically to make full alignment and try to detect all the possible members for each initial family. The families/domains in Pfam-A are in high quality level and could be used as a reliable annotation/classification evidence for the query sequence.

The Pfam-B is created based on the sequence alignment of the entries from ADDA by using HMMs. Those entries existing already in Pfam-A are excluded. There are no confirmed annotation and no manual quality checking for the families in Pfam-B, therefore there could be some errors (e.g. the members in one family could be just randomly aligned) and the overall quality is relative low. However, it still can be useful for the situation that one can not find domain evidence in Pfam-A for the query sequence.

We used the "sequence search" feature of Pfam website with the FASTA-sequence of the protein to determine potential domains or domain families. Afterwards we checked out the corresponding page of the domain (family) for a GO annotation. The search was performed with the default settings (cut-off: use E-Value, threshold 1.0), but we also included Pfam-B in the search. Only one hit in Pfam-B was found which does not have any GO annotation and hence there was no gain in including Pfam-B. The classification in respect to the significance of a hit was done by the Pfam search algorithm.

ProtFun

ProtFun tries to assign a function to the query protein. For this purpose, it uses the prediction of several other features like post-translational modification sites or localization of the protein. The prediction of these features itself is based on other programs like SignalP, TargetP, NetOGlyc, TMHMM and some others. ProtFun was developed by Jensen et al. in 2002<ref name=jensen_1>Jensen et al., "Prediction of human protein function from post-translational modifications and localization features.", J Mol Biol. 2002 Jun 21, PubMed</ref> and the prediction of the Gene Ontology category was added in 2003<ref name=jensen_2>Jensen et al., "Prediction of human protein function according to Gene Ontology categories.", Bioinformatics. 2003 Mar 22, PubMed</ref>.

We used the webserver of ProtFun 2.2 with the default settings and the FASTA-sequence of the protein as the input. The output contains predictions about the functional category, enzyme/nonenzyme, enzyme class and the Gene Ontology category. In our case, only the result of the latter was relevant. The term 'Prob' represents the calculated probability by ProtFun that the query belongs to the category. This probability is dependent on the prior probability of the category. 'Odds' describes the odds that the query belongs to the certain category and is not influenced by the prior probability.<ref name=ProtFun>Explanation of the ProtFun 2.2 output.</ref> The class with the highest information content and with the highest probability is marked bold. Additionally we provide a table for each query that contains the categories with the highest information content or probability, respectively, and their associated GO id. For this purpose, we used the search feature of the Gene Ontology website.

Proteins

The results of the prorgams are listed for the five proteins in the following sections. If a prediction is correct, it will be marked with a green background. For this purpose, we used the GO annotation of the corresponding protein at the EBI website (see list of annotated GO ids).

GLA

GOPET

GOid	Confidence	GO term
GO:0016798	98%	hydrolase activity acting on glycosyl bonds
GO:0004553	98%	hydrolase activity hydrolyzing O-glycosyl compounds
GO:0016787	97%	hydrolase activity
GO:0004557	96%	alpha-galactosidase activit
GO:0008456	89%	alpha-N-acetylgalactosaminidase activity

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Melibiase	Family	x	Molecular function	hydrolase activity, hydrolyzing O-glycosyl compounds	GO:0004553
Pfam-A	Melibiase	Family	x	Biological process	carbohydrate metabolic process	GO:0005975

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.090    0.419
 Receptor                             0.014    0.083
 Hormone                              0.002    0.318
 Structural_protein                   0.004    0.127
 Transporter                          0.024    0.222
 Ion_channel                          0.010    0.169
 Voltage-gated_ion_channel            0.003    0.127
 Cation_channel                       0.010    0.215
 Transcription                        0.047    0.367
 Transcription_regulation             0.026    0.204
 Stress_response                      0.049    0.552
 Immune_response                      0.012    0.136
 Growth_factor                        0.006    0.412
 Metal_ion_transport                  0.009    0.020

Type	GO category	GO aspect	GO id
Highest probablity	Signal transducer	Molecular function	GO:0004871

BACR_HALSA

GOPET

GOid	Confidence	GO term
GO:0005216	77%	ion channel activity
GO:0008020	75%	G-protein coupled photoreceptor activity
GO:0015078	60%	hydrogen ion transmembrane transporter activity

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Bacteriorhodopsin-like protein	Domain	x	Cellular component	membrane	GO:0016020
Pfam-A	Bacteriorhodopsin-like protein	Domain	x	Molecular function	ion channel activity	GO:0005216
Pfam-A	Bacteriorhodopsin-like protein	Domain	x	Biological process	ion transport	GO:0006811
Pfam-A	Domain of unknown function DUF21	Family		-	-	-

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.258    1.205
 Receptor                             0.355    2.087
 Hormone                              0.001    0.206
 Structural_protein                   0.006    0.200
 Transporter                       => 0.440    4.036
 Ion_channel                          0.010    0.169
 Voltage-gated_ion_channel            0.004    0.172
 Cation_channel                       0.078    1.689
 Transcription                        0.026    0.205
 Transcription_regulation             0.028    0.226
 Stress_response                      0.012    0.139
 Immune_response                      0.011    0.128
 Growth_factor                        0.010    0.727
 Metal_ion_transport                  0.049    0.106

Type	GO category	GO aspect	GO id
Highest information content / highest probability	Transporter	Molecular function	GO:0005215

RET4_HUMAN

GOPET

GOid	Confidence	GO term
GO:0005488	90%	binding
GO:0005501	81%	retinoid binding
GO:0008289	80%	lipid binding
GO:0019841	78%	retinol binding
GO:0005215	78%	transporter activity
GO:0016918	78%	retinal binding
GO:0005319	69%	lipid transporter activity
GO:0008035	60%	high-density lipoprotein particle binding

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Lipocalin / cytosolic fatty-acid binding protein family	Domain	x	Molecular function	binding	GO:0005488
Pfam-A	DspF/AvrF protein	Family		-	-	-
Pfam-B	PB008544	-	-	-	-	-

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.202    0.942
 Receptor                             0.147    0.862
 Hormone                              0.004    0.667
 Structural_protein                   0.002    0.058
 Transporter                          0.025    0.232
 Ion_channel                          0.016    0.288
 Voltage-gated_ion_channel            0.003    0.148
 Cation_channel                       0.010    0.215
 Transcription                        0.027    0.207
 Transcription_regulation             0.025    0.196
 Stress_response                      0.161    1.829
 Immune_response                   => 0.239    2.813
 Growth_factor                        0.023    1.617
 Metal_ion_transport                  0.009    0.020

Type	GO category	GO aspect	GO id
Highest information content / highest probability	Immune response	Biological process	GO:0006955

INSL5_HUMAN

GOPET

GOid	Confidence	GO term
GO:0005179	80%	hormone activity

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Insulin/IGF/Relaxin family	Domain	x	Cellular component	extracellular region	GO:0005576
Pfam-A	Insulin/IGF/Relaxin family	Domain	x	Molecular function	hormone activity	GO:0005179

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.374    1.746
 Receptor                             0.128    0.750
 Hormone                           => 0.247   37.936
 Structural_protein                   0.001    0.041
 Transporter                          0.025    0.228
 Ion_channel                          0.010    0.168
 Voltage-gated_ion_channel            0.003    0.131
 Cation_channel                       0.010    0.215
 Transcription                        0.054    0.425
 Transcription_regulation             0.091    0.724
 Stress_response                      0.099    1.128
 Immune_response                      0.178    2.090
 Growth_factor                        0.061    4.379
 Metal_ion_transport                  0.009    0.020

Type	GO category	GO aspect	GO id
Highest information content	Hormone	Molecular function	GO:0005179
Highest probability	Signal transducer	Molecular function	GO:0004871

LAMP1_HUMAN

GOPET

GOid	Confidence	GO term
GO:0004812	60%	aminoacyl-tRNA ligase activity
GO:0005524	60%	ATP binding

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Lysosome-associated membrane glycoprotein	Family	x	Cellular component	membrane	GO:0016020
Pfam-A	Protein of unknown function DUF1180	Family		-	-	-

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.396    1.849
 Receptor                             0.282    1.659
 Hormone                              0.001    0.206
 Structural_protein                   0.011    0.408
 Transporter                          0.024    0.222
 Ion_channel                          0.008    0.147
 Voltage-gated_ion_channel            0.002    0.111
 Cation_channel                       0.010    0.215
 Transcription                        0.032    0.247
 Transcription_regulation             0.018    0.142
 Stress_response                      0.246    2.795
 Immune_response                   => 0.371    4.368
 Growth_factor                        0.013    0.956
 Metal_ion_transport                  0.009    0.020

Type	GO category	GO aspect	GO id
Highest information content	Immune response	Biological process	GO:0006955
Highest probability	Signal transducer	Molecular function	GO:0004871

A4_HUMAN

GOPET

GOid	Confidence	GO term
GO:0004866	87%	endopeptidase inhibitor activity
GO:0004867	86%	serine-type endopeptidase inhibitor activity
GO:0030568	83%	plasmin inhibitor activity
GO:0030304	83%	trypsin inhibitor activity
GO:0030414	82%	peptidase inhibitor activity
GO:0005488	79%	binding
GO:0005515	74%	protein binding
GO:0046872	73%	metal ion binding
GO:0003677	71%	DNA binding
GO:0008201	70%	heparin binding
GO:0008270	69%	zinc ion binding
GO:0005507	69%	copper ion binding
GO:0005506	67%	iron ion binding

Pfam

Source	Description	Entry type	Significant	GO aspect	GO description	GO id
Pfam-A	Amyloid A4 N-terminal heparin-binding	Domain	x	Cellular component	integral to membrane	GO:0016021
Pfam-A	Amyloid A4 N-terminal heparin-binding	Domain	x	Molecular function	binding	GO:0005488
Pfam-A	Copper-binding of amyloid precursor, CuBD	Domain	x	-	-	-
Pfam-A	Kunitz/Bovine pancreatic trypsin inhibitor domain	Domain	x	Molecular function	serine-type endopeptidase inhibitor activity	GO:0004867
Pfam-A	E2 domain of amyloid precursor protein	Domain	x	-	-	-
Pfam-A	Beta-amyloid peptide	Family	x	Cellular component	integral to membrane	GO:0016021
Pfam-A	Beta-amyloid peptide	Family	x	Molecular function	binding	GO:0005488
Pfam-A	beta-amyloid precursor protein C-terminus	Family	x	-	-	-
Pfam-A	Exonuclease VII, large subunit	Family		-	-	-
Pfam-A	Transcriptional activator TraM	Family		-	-	-

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.126    0.586
 Receptor                             0.036    0.211
 Hormone                              0.001    0.206
 Structural_protein                => 0.034    1.205
 Transporter                          0.024    0.222
 Ion_channel                          0.009    0.162
 Voltage-gated_ion_channel            0.002    0.108
 Cation_channel                       0.010    0.215
 Transcription                        0.043    0.335
 Transcription_regulation             0.018    0.143
 Stress_response                      0.076    0.862
 Immune_response                      0.016    0.183
 Growth_factor                        0.005    0.372
 Metal_ion_transport                  0.009    0.020

Type	GO category	GO aspect	GO id
Highest information content	Structural protein	Molecular function	GO:0005198
Highest probability	Signal transducer	Molecular function	GO:0004871

Evaluation of the Results

We used the Gene Ontology annotation of the corresponding protein of the EBI website as the reference for the evaluation of the programs (see list of annotated GO ids). Afterwards we determined the true positives, false positives, true negatives and false negatives and calculated the sensitivity and specificity. For this, we created Venn diagrams which we provide on this page. A true negative was defined as a false positive of one of the other two programs. "Overall" is the evaluation of all six proteins. The highest value of sensitivity/specificity in respect to a certain protein is marked with a green background.

Protein	Program	True positives	False positives	True negatives	False negatives	Sensitivity	Specificity
GLA	GOPET	4	1	1	18	0.18	0.5
	Pfam	2	0	2	20	0.09	1
	ProtFun	0	1	1	22	0	0.5
BACR_HALSA	GOPET	1	2	1	11	0.08	0.33
	Pfam	3	0	3	9	0.25	1
	ProtFun	0	1	2	12	0	0.67
RET4_HUMAN	GOPET	5	3	1	36	0.12	0.25
	Pfam	1	0	4	40	0.02	1
	ProtFun	0	1	3	41	0	0.75
INSL5_HUMAN	GOPET	1	0	1	3	0.25	1
	Pfam	2	0	1	2	0.5	1
	ProtFun	1	1	0	3	0.25	0
LAMP1_HUMAN	GOPET	0	2	2	17	0	0.5
	Pfam	1	0	4	16	0.06	1
	ProtFun	0	2	2	17	0	0.5
A4_HUMAN	GOPET	7	6	2	71	0.09	0.25
	Pfam	3	0	8	75	0.04	1
	ProtFun	0	2	6	78	0	0.75
Overall	GOPET	18	14	8	156	0.10	0.36
	Pfam	12	0	22	162	0.07	1
	ProtFun	1	8	14	171	0	0.64

There are two things remarkable. First, Pfam does not have a single false positive prediction and hence the specificity is 1. This leads us to the conclusion that the search feature has also a very good specificity and that the annotation of the domains and families in Pfam is also very accurate in respect to Gene Ontology. Second, ProtFun achieved only one true positive and thus the sensitivity is close to 0. This can be explained due to the fact that ProtFun only predicts very general Gene Ontology categories (e.g. immune response, receptor, etc.) and therefore the prediction misses alot of subannotations. Nevertheless five out of six predictions of the main categories were also wrong and therefore we would not recommend to use ProtFun for a prediction of GO terms.

In respect to the sensitivity, GOPET achieves the highest value with 0.10 which is still very low. Since the overall sensitivity of Pfam is slighty lower (0.07) and the specificity of Pfam is 1, we would prefer Pfam for the prediction of GO term. Overall the sensitivity is very low and hence you have to keep in mind that you will miss a lot of GO terms when using one of these programs.

List of annotated GO ids

References

Sequence-based predictions GLA

Contents

Proteins

Secondary structure prediction

PSIPRED

Prediction

Jpred3

Comparison with DSSP

Results

Prediction of disordered regions

DISOPRED

POODLE

POODLE-S: Missing residues

POODLE-S: High B-Factor residues

IUPRED

Short Disorder

Long Disorder

META-Disorder

PROFbval

NORSnet

UCON

Prediction of transmembrane alpha-helices and signal peptides

Programs

TMHMM

Phobius/Polyphobius

Octopus/Spoctopus

TargetP

SignalP

Proteins

General evaluation

GLA

Transmembrane alpha-helices

Signal peptides

BACR_HALSA

Transmembrane alpha-helices

Signal peptides

RET4_HUMAN

Transmembrane alpha-helices

Signal peptides

INSL5_HUMAN

Transmembrane alpha-helices

Signal peptides

LAMP1_HUMAN

Transmembrane alpha-helices

Signal peptides

A4_HUMAN

Transmembrane alpha-helices

Signal peptides

TargetP

Prediction of GO terms

GOPET

Pfam

ProtFun

Proteins

GLA

GOPET

Pfam

ProtFun

BACR_HALSA

GOPET

Pfam

ProtFun

RET4_HUMAN

GOPET

Pfam

ProtFun

INSL5_HUMAN

GOPET

Pfam

ProtFun

LAMP1_HUMAN

GOPET

Pfam

ProtFun

A4_HUMAN

GOPET

Pfam

ProtFun

Evaluation of the Results

List of annotated GO ids