Sequence-based predictions GLA

From Bioinformatikpedia

by Benjamin Drexler and Fabian Grandke

Find additional information and graphics here.

Proteins

Protein GLA BACR_HALSA RET4_HUMAN INSL5_HUMAN LAMP1_HUMAN A4_HUMAN
Organism Homo sapiens Halobacterium salinarium Homo sapiens Homo sapiens Homo sapiens Homo sapiens
Size 429 AA 262 AA 201 AA 135 AA 417 AA 770 AA
Subcellular location Lysosome Cell membrane Secreted Secreted Cell membrane Membrane
Function Glycosidase Photoreceptor protein Sensory transduction Hormone Carbohydrate presentation Serine protease inhibitor
Transmembrane No 7 transmembrane regions No No 1 transmembrane region 1 transmembrane region
UniProt entry P06280 P02945 P02753 Q9Y5Q6 P11279 P05067

Secondary structure prediction

PSIPRED

http://bioinf.cs.ucl.ac.uk/psipred/

PSIPRED was developed by David T. Jones at the University of Warwick in 1998. Nowadays the server runs at the University College London. <ref name=PSIPRED>History of PSIPRED</ref>

PSIPRED predicts secondary structures based on neuronal networks with a single hidden layer and feed-forward back-propagation.<ref name=Praesi>Talk_Task3</ref> The workflow can be split into three states:

  1. Sequence profiles generation: Neuronal network gets position-specific matrix from PSI-Blast as input
  2. Initial secondary structure prediction: Output Layer predicts one of the three secondary structures
  3. Predicted structure filtering: Additional network filters the raw predictions from the previous step

PSIPRED takes an amino acid sequence as input. The output is a the predicted secondary structure as shown in Figure 1.

Prediction

Figure 1:PSIPRED result for GLA

Figure 1 shows 10 alpha helices, 16 beta strands and 27 coils.

Jpred3

http://www.compbio.dundee.ac.uk/www-jpred/index.html

Jpred3 was developed by C. Cole at the University of Dundee. Similar to PSIPRED a neuronal network is used for the prediction of the secondary structure. For single sequences as input the program uses PSI-Blast sequence profiles, as well. Jpred3 is also capable of taking multiple sequence alignments as input. Both are further processed using the Jpred algorithm. <ref name=jpred>Cole et al., " The Jpred 3 secondary structure prediction server.", Nucleic acids research. 2008 Jul 1, PubMed</ref>


The table below shows the PDB entries found by Jpred3 concerning our input sequence:

EBI Chain Describtion E-value
3hg5 B Alpha-galactosidase A 0.0
3hg5 A Alpha-galactosidase A 0.0
3hg4 B Alpha-galactosidase A 0.0
3hg4 A Alpha-galactosidase A 0.0
3hg2 B Alpha-galactosidase A 0.0
3hg2 A Alpha-galactosidase A 0.0
3gxt B Alpha-galactosidase A 0.0
3gxt A Alpha-galactosidase A 0.0
3gxp B Alpha-galactosidase A 0.0
3gxp A Alpha-galactosidase A 0.0
3gxn B Alpha-galactosidase A 0.0
3gxn A Alpha-galactosidase A 0.0
1r47 B Alpha-galactosidase A 0.0
1r47 A Alpha-galactosidase A 0.0
1r46 B Alpha-galactosidase A 0.0
1r46 A Alpha-galactosidase A 0.0
3hg3 B Alpha-galactosidase A 0.0
3hg3 A Alpha-galactosidase A 0.0
3lxc B Alpha-galactosidase A 0.0
3lxc A Alpha-galactosidase A 0.0
3lxb B Alpha-galactosidase A 0.0
3lxb A Alpha-galactosidase A 0.0
3lxa B Alpha-galactosidase A 0.0
3lxa A Alpha-galactosidase A 0.0
3lx9 B Alpha-galactosidase A 0.0
3lx9 A Alpha-galactosidase A 0.0
1ktc A alpha-N-acetylgalactosaminidase e-113
1ktb A alpha-N-acetylgalactosaminidase e-113
3igu B Alpha-N-acetylgalactosaminidase e-100
3igu A Alpha-N-acetylgalactosaminidase e-100
3h55 B Alpha-N-acetylgalactosaminidase e-100
3h55 A Alpha-N-acetylgalactosaminidase e-100
3h54 B Alpha-N-acetylgalactosaminidase e-100
3h54 A Alpha-N-acetylgalactosaminidase e-100
3h53 B Alpha-N-acetylgalactosaminidase e-100
3h53 A Alpha-N-acetylgalactosaminidase e-100

The lightblue colored protein is the protein that was used as query sequence.

We ignored those results and forced Jpred to make a prediction anyway. The programs output was a prediction of the secondary structure among others.

Comparison with DSSP

http://swift.cmbi.ru.nl/servers/html/

DSSP was developed by W.Kabsch and C. Sander in 1983. It is a database, containing secondary structure assignments for each protein in PDB. As it is no prediction program itself, it is used to compare the results of a prediction with the data in DSSP. Therefor it uses the 3D coordinates from PDB entries and used them to calculate DSSP entries.<ref name="dssp">Kabsch et al., "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.", Biopolymers, 1983,PubMed</ref>

Figure 2 is a color labeled picture of the prediction by DSSP. The letter code used by DSSP has been translated to the 3-letter code of H = helix, E = strand and C = coil.

Figure 2: Result of DSSP

Find a pdf version of Figure 2 here: File:GLA DSSP Comp.pdf


Results

Structural Element PSIPRED Jpred3 DSSP
Helices 10 10 12
Strands 16 15 16
Coils 27 26 29

The results of Jpred3 and PSIPRED are very similar to the reference prediction by DSSP. Although they are not entirely equal, the predictions are very close and thus can be labeled successful.

Prediction of disordered regions

As GLA was not found in the DisProt database we tried to predict disordered regions with several tools.

DISOPRED

An online version of DISOPRED is available at http://bioinf.cs.ucl.ac.uk/disopred/, but we run it locally, as well. Therefor we first had to adapt some paths and then we run it using the command sudo ./runpsipred.

DISOPRED was developed by JJ. Ward, JS. Sodhi, LJ. McGuffin, BF. Buxton and DT. Jones in 2004. A neuronal network is used to predict disordered regions. As it is a knowledge-based method, the DISOPRED neuronal network is trained with X-ray structures from PDB. The program takes a sequence as input and runs PSI-Blast against a database. The trained neuronal network predicts residuewise profiles and classifies them as disordered or not disordered. <ref name=disopred>Ward et al., "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life", Journal of Molecular Biology. 2004, PubMed</ref> <ref name=Praesi>Talk_Task3</ref>

Figure 3: Prediction of disordered regions by DISOPRED

Figure 3 shows that DISOPRED predicts two disordered regions. One is at the very beginning of the protein sequence the other one is at the end. The lightgrey dotted line shows, that another region at the position ~35 was predicted, as well, but as DISOPRED filters the results, the peak is smoothed out.

POODLE

http://mbs.cbrc.jp/poodle/poodle.html

POODLE was developed by S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda and T. Noguchi in 2007. It predicts disordered regions by usage of a machine-learning approach.

There are four different variants available.:<ref name=poodlehp>POODLE Help Page</ref>

  • POODLE-L<ref name=poodle_l>Hirose et al., " POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.", PharmaDesign. 2007, PubMed</ref>: Predicts long disordered regions(>40 consecutive residues).
  • POODLE-I<ref name=poodle_i>POODLE-I</ref>: Uses structural information predictors based on a work-flow approach.
  • POODLE-S<ref name=poodle_s>Shimizu et al., "POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.", Bioinformatics. 2007, PubMed</ref>: Predicts short disordered regions. Has two subversions differing in the preparation of the databases:
    • Missing residues: Missing regions in X-ray structures.
    • High B-Factor residues: Regions with high B-Factors.
  • POODLE-W<ref name=poodle_w>Shimizu et al., "Predicting mostly disordered proteins by using structure-unknown protein data

.", Bioinformatics. 2007, PubMed</ref>: Specialized on mostly disordered proteins.

For the analysis only POODLE-S was used, because not even short disordered regions were found in our protein, so long regions are even more unlikely. All variants only need an amino acid sequence as input and provide as well a picture as a textual output of the predicted regions.

POODLE-S: Missing residues

Figure 4: Prediction of disordered regions by POODLE-S, with "Missing Residues" parameter

Figure 4 shows two disordered regions at both ends of the amino acid sequence, predicted by POODLE-S using the "Missing Residues" option. Additionally there are two peaks, at position ~35 and ~400 that are near to the disordered cutoff value.

POODLE-S: High B-Factor residues

Figure 5: Prediction of disordered regions by POODLE-S, with "High B-Factor Residues" parameter

Figure 5 shows the prediction by POODLE-S using the "High B-Factor residues" option. Only one disordered region is predicted. It is at the very beginning of the protein sequence. There are manly small peaks in the curve, so the result is very noisy.

IUPRED

http://iupred.enzim.hu/index.html

IUPRED was developed by Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon in 2005. It is a prediction method for ordered and disordered regions in protein sequences. IUPRED is based on an energy function and the assumption that there are less interresidue interactions in disordered regions. <ref name=poodle_l>Dosztányi et al., "IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.", Bioinformatics. 2005, PubMed</ref> The input of IUPRED is just an amino acid sequence. The output is a file containing predicted values for each position in the sequence. We used R-scripts to create the Figures.

Short Disorder

Figure 6: Prediction of disordered regions by IUPRED with "Short Disorder" paramter

Figure 6 shows three disordered regions, predicted by IUPRED using the "Short Disorder" option. The first one is at the beginning of the sequence, the next one at position ~107 and the last one at the end of the sequence. Additionally, there are two significant peaks at positions ~275 and ~320, but the predicted values are smaller than the cutoff, so they are not predicted as disordered regions.

Long Disorder

Figure 7: Prediction of disordered regions by IUPRED with "Long Disorder" parameter

Figure 7 shows the noisy prediction of IUPRED in the "Long Disorder" mode. It predicts only one disordered region at position ~107.

META-Disorder

http://www.predictprotein.org/

Hint: You will have to register. It is free of charge, but you can submit max. 3 sequences within the next 12 months!

The program takes an amino acid sequence as input. The resulting file consists of predictions of the programs described below. We used an R-script to create the figure. Metadisorder was developed by Avner Schlessinger and Burkhard Rost in 2005 at the columbia university. It combines different methods and uses various sources of information to predict disordered regions. Metadisorder makes use of the methods described below <ref name=MD>Schlessinger et al., "Improved Disorder Prediction by Combination of Orthogonal Approaches.", PLoS ONE. 2009, PLoS ONE</ref> <ref name=rostlab_MD>Rostlab - Metadisorder</ref>:

PROFbval

PROFbval is a residue mobility prediction method based on the amino-acid sequence. <ref name=rostlab_prof>Rostlab - PROFbval</ref>

NORSnet

NORSnet is a method that identifies unstructured loops, based on neuronal networks. <ref name=rostlab_nors>Rostlab - NORSnet</ref>

UCON

UCON predicts natively unstructured regions through contacts. <ref name=rostlab_ucon>Rostlab - UCON</ref>

Figure 8 shows the results by META-Disorder(black line). Only one disordered region is predicted at the very beginning of the sequence.

Figure 8: Prediction of disordered regions by META-Disorder

Evaluation of the results

Conlusion of the results of the predictions of disordered regions:

Position Disopred POODLE-S(MR) POODLE-S(HBF) IUPRED(SHORT) IUPRED(LONG) META-Disorder
~3
~107
~429

The disordered region at the beginning of the sequence is predicted by all but one methods. The one at the end is even predicted by 50% of the used programs. Unfortunately the prediction of disordered regions at the periphery of protein sequences is often positive, but wrong because of the character of electron density maps. The disordered region at position ~107 is only predicted by the two IUPRED methods and can not be confirmed by other programs. Thus a true disordered region at that position is very unlikely, as well. Summarizing, there is no disordered region in the protein.

Prediction of transmembrane alpha-helices and signal peptides

Programs

All programs but TMHMM were run on web servers and only take an amino acid sequence as input.

TMHMM

TMHMM is a program that predicts transmembrane helices in proteins. It is based on a hidden Markov model and was developed by Sonnhammer et. al in 1998<ref name=sonnhammer>Sonnhammer et al., "A hidden Markov model for predicting transmembrane helices in protein sequences.", Proc Int Conf Intell Syst Mol Biol. 1998, PubMed</ref>.

We used the webserver of TMHMM with the FASTA-sequence of the protein as input. The output contains some statistics (e.g. number of TM helices, expected number of AAs in TM helices, probability that the N-term is on the cytoplasmic membrane), a listening of the labeled sequence areas (e.g. inside, outside, TM helix) and a plot of the probabilities for the residues. Additionally, if it is predicted that there is a great number of the first 60 AA are part of a TM helix, TMHMM will indicate that there could be a signal peptide in the N-term region.

Phobius/Polyphobius

Phobius was developed by Käll, Krogh and Sonnhammer in 2004 at the Stockholm Bioinformatics Center. It is based on HMMs and predicts transmembrane helices and signal-peptides at the N-terminal. Phobius takes a protein sequence(FASTA format) as input an outputs diagrams and text files containing the predictions. <ref name=phobius>Käll et al., "A combined transmembrane topology and signal peptide prediction method.", J Mol Biol. 2004, PubMed</ref>

Polyphobius was developed by L. Käll, A. Krogh, EL. Sonnhammer in 2005 at the Stockholm Bioinformatics Center. It is very similar to Phobius and uses HMMs, as well. Additionally Polyphobius uses information from homologous sequences to improve the prediction accuracy. <ref name=polyphobius>Käll et al., "An HMM posterior decoder for sequence feature prediction that includes homology information.", Bioinformatics. 2005, PubMed</ref>

We used the web server for both versions of the program.

Octopus/Spoctopus

Octopus was developed by H. Viklund and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It combines HMMs and artificial neuronal networks(ANN) to predict the topology of transmembrane proteins. At first several ANNs are used to make predictions for every single residue and afterwards HMMs are used to smooth the results and combine them to a useful prediction.<ref name=octopus>Viklund et al., "OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar.", Bioinformatics. 2008, PubMed</ref>

Spoctopus was developed by H. Viklund, A. Bernsel, M. Skwark and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It is almost identical to Octopus, but additionally predicts the signal peptide.<ref name=spoctopus>Viklund et al., "SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology.", Bioinformatics. 2008, PubMed</ref>

We used the web server for both versions of the program.


TargetP

TargetP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It uses neuronal networks to predict the subcellular location of eukaryotic proteins. The method is based on the predicted presence of any of the N-terminal presequences (i.e.:chloroplast transit peptide, mitochondrial targeting peptide or secretory pathway signal peptide).<ref name=targetp>Emanuellson et al., "Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.", J Mol Biol. 2000, PubMed</ref>

We used the web server of the program.

SignalP

SignalP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It predicts presence of a signal peptide and provides information about hte exact location of signal peptide cleavage sites. The most recent version of the method uses as well AMMs as HMMs. <ref name=signalp>Nielsen et al., "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.", Protein engineering. 1997, PubMed</ref>

We used the web server of the program.

Proteins

In the following sections we present the results of the programs for the proteins, i.e. the prediction of transmembrane alpha-helices and signal peptides. The sequence annotation of the UniProt entries (see protein overview) were used as the reference. For the colouring of the tables, we used the following scheme:

  • green (completely correct): the predicted positions are allowed to differ one position per index, e.g. a prediction of the positions 19 - 31 will be marked green, if the reference is 20 - 30.
    • exception: TMHMM per se is not able to predict signal peptides and hence it will include the N-terminal of the protein in the first region almost always even though it is a signal peptide. Therefore we also declare the prediction of the first region correct, when it starts at the first residue and the end residue of the first region is correct.
  • yellow (partial correct): a vast majority of the residues is assigned to the correct region.
  • red (wrong): the prediction is completely wrong, e.g. a region is predicted which does not exist.

General evaluation

TMHMM does not predict signal peptides and therefore it often includes the beginning of the protein into the first region by mistake. It also made an error by missing a transmembrane alpha-helix and Phobius/Polyphobius often had more accurate predictions.

Octopus has a similar problem to TMHMM, since it also does not predict signal peptides and hence it predicts a transmembrane alpha-helix for the signal peptide. On the other hand, Spoctopus predicts the transmembrane regions very good, but is always missing the beginning of the signal peptide.

Phobius and Polyphobius made very good predictions in respect to transmembrane alpha-helices and the predictions of the signal peptides were completely correct. Since it delivers the whole package (i.e. transmembrane alpha-helix prediction, signal peptide prediction, highst accurarcy), we prefer to work with these programs.

GLA

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Cytoplasmic - - - - 1 - 9 -
Transmembrane - - - - 10 - 30 -
Non-cytoplasmic 32 - 429 1 - 429 32 - 429 32 - 429 31 - 429 32 - 429

No program predicted a transmembrane alpha-helix except for Octopus which predicts the signal peptide as a cytoplasmic and transmembrane region. We also observe this problem of Octopus for all of the other five proteins.

Signal peptides

Region Reference Phobius Polyphobius Spoctopus SignalP
Signal peptide 1 - 31 1 - 31 1 - 31 11 - 31 1 - 31
N-Region - 1 - 9 1 - 12 - -
H-Region - 10 - 22 13 - 26 - -
C-Region - 23 - 31 27 - 31 - -

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

BACR_HALSA

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Non-cytoplasmic 14 - 23 1 - 22 1 - 22 1 - 21 1 - 22 1 - 22
Transmembrane 24 - 42 23 - 42 23 - 42 22 - 43 23 - 43 23 - 43
Cytoplasmic 43 - 56 43 - 54 43 - 53 44 - 54 44 - 54 44 - 54
Transmembrane 57 - 75 55 - 77 54 - 76 55 - 77 55 - 75 55 - 75
Non-cytoplasmic 76 - 91 78 - 91 77 - 95 78 - 94 76 - 95 76 - 95
Transmembrane 92 - 109 92 - 114 96 - 114 95 - 114 96 - 116 96 - 116
Cytoplasmic 110 - 120 115 - 120 115 - 120 115 - 120 117 - 121 117 - 120
Transmembrane 121 - 140 121 - 143 121 - 142 121 - 141 122 - 142 121 - 141
Non-cytoplasmic 141 - 147 144 - 147 143 - 147 142 - 147 143 - 147 142 - 147
Transmembrane 148 - 167 148 - 170 148 - 169 148 - 166 148 - 168 148 - 168
Cytoplasmic 168 - 185 171 - 189 170 - 189 167 - 186 169 - 185 169 - 185
Transmembrane 186 - 204 190 - 212 190 - 212 187 - 205 186 - 206 186 - 206
Non-cytoplasmic 205 - 216 213 - 262 213 - 217 206 - 215 207 - 216 207 - 216
Transmembrane 217 - 236 - 218 - 237 216 - 237 217 - 237 217 - 237
Cytoplasmic 237 - 262 - 238 - 262 238 - 262 238 - 262 238 - 262

All programs predict the correct number of transmembrane alpha-helices apart from TMHMM which misses the last one. Polyphobius has the best result since it has the highest number of completely correct predicted regions.

Signal peptides

No program predicted a signal peptide which is correct.

RET4_HUMAN

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Cytoplasmic - - - - 1 - 1 -
Transmembrane - - - - 2 - 23 -
Non-cytoplasmic 19 - 201 1 - 201 19 - 201 19 - 201 24 - 201 20 - 201

All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.

Signal peptides

Region Reference Phobius Polyphobius Spoctopus SignalP
Signal peptide 1 - 18 1 - 18 1 - 18 6 - 19 1 - 18
N-Region - 1 - 2 1 - 3 - -
H-Region - 3 - 13 4 - 13 - -
C-Region - 14 - 18 14 - 18 - -

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

INSL5_HUMAN

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Cytoplasmic - - - - 1 - 1 -
Transmembrane - - - - 2 - 32 -
Non-cytoplasmic 23 - 135 1 - 135 23 - 135 23 - 135 33 - 135 24 - 135

All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.

Signal peptides

Region Reference Phobius Polyphobius Spoctopus SignalP
Signal peptide 1 - 22 1 - 22 1 - 22 6 - 23 1 - 22
N-Region - 1 - 5 1 - 4 - -
H-Region - 6 - 17 5 - 16 - -
C-Region - 18 - 22 17 - 22 - -

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

LAMP1_HUMAN

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Cytoplasmic - 1 - 10 - - 1 - 10 -
Transmembrane - 11 - 33 - - 11 - 31 -
Non-cytoplasmic 29 - 382 34 - 383 29 - 381 29 - 381 32 - 383 30 - 383
Transmembrane 383 - 405 384 - 406 382 - 405 382 - 405 384 - 404 384 - 404
Cytoplasmic 406 - 417 407 - 417 406 - 417 406 - 417 405 - 417 405 - 417

Phobius, Polyphobius and Spoctopus predict the one transmembrane alpha-helix correctly. Octopus has the common error that it assigns a cytoplasmic and transmembrane region to the signal peptide of the protein. TMHMM also predicts a transmembrane region for the signal peptide. The developers of TMHMM indicate this problem on the instruction page of TMHMM.

Signal peptides

Region Reference Phobius Polyphobius Spoctopus SignalP
Signal peptide 1 - 28 1 - 28 1 - 28 12 - 29 1 - 28
N-Region - 1 - 10 1 - 9 - -
H-Region - 11 - 22 10 - 22 - -
C-Region - 23 - 28 23 - 28 - -

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

A4_HUMAN

The graphical output of the programs is provided on this page.

Transmembrane alpha-helices

Region Reference TMHMM Phobius Polyphobius Octopus Spoctopus
Cytoplasmic - - - - 1 - 5 -
Transmembrane - - - - 6 - 11 -
Non-cytoplasmic 18 - 699 1 - 700 18 - 700 18 - 700 12 - 701 19 - 701
Transmembrane 700 - 723 701 - 723 701 - 723 701 - 723 702 - 722 702 - 722
Cytoplasmic 724 - 770 724 - 770 724 - 770 724 - 770 723 - 770 723 - 770

Every program predicts the one transmembrane alpha-helix of the protein except for Octopus.

Signal peptides

Region Reference Phobius Polyphobius Spoctopus SignalP
Signal peptide 1 - 17 1 - 17 1 - 17 5 - 18 1 - 17
N-Region - 1 - 1 1 - 3 - -
H-Region - 2 - 12 4 - 12 - -
C-Region - 13 - 17 13 - 17 - -

All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.

TargetP

The result TargetP is shown in the following table. A explanation of the output is given on this page.

Name Length mTP SP other Loc RC
GLA 429 0.041 0.860 0.141 S 2
BACR_HALSA 262 0.019 0.897 0.562 S 4
RET4_HUMAN 201 0.242 0.928 0.020 S 2
INSL5_HUMA 135 0.074 0.899 0.037 S 1
LAMP1_HUMA 417 0.043 0.953 0.017 S 1
A4_HUMAN 770 0.035 0.937 0.084 S 1


TargetP predicts for all six proteins a signal peptide and therefore assigns the protein to the secretory pathway. This prediction is correct for every protein.

Prediction of GO terms

GOPET

GOPET stands for Gene Ontology term Prediction and Evaluation Tool and was developed by Vinayagam et al. in 2006<ref name=vinayagam>Vinayagam et al., "GOPET: a tool for automated predictions of Gene Ontology terms.", BMC Bioinformatics. 2006 Mar 20, PubMed</ref>. It is based on homology searches on GO-mapped protein databases and uses support vector machines for the calculation of the confidence values.

We used the webserver of GOPET with the default settings (GO aspect: molecular function, maximum number of predictions: 20, confidence threshold: 60, GOPET model 2007 june, version 2.0, GOPET database 2007) and the FASTA-sequence of the protein as input. The results only contain GOids of the GO aspect "molecular function", since the other two GO aspects (cellular component and biological process) were not available.

Pfam

Pfam is a database composed of the protein domain families that is created by using Hidden Markov Models profiles (HMMs) and was first described by Sonnhammer et al. in 1997<ref name=sonnhammer>Sonnhammer et al., "Pfam: a comprehensive database of protein domain families based on seed alignments.", Proteins. 1997 Jul, PubMed</ref>. Each protein domain family is represented by a multiple sequence alignment and a HMMs. One can search one protein sequence against Pfam and obtain all the possible domains that the query sequence might contain.

Pfam database includes two parts A and B where the protein domain families with different quality levels. In the 1.0 release of Pfam, the protein entries in Pfam-A and Pfam-B were from Swissprot (a few initial members of seed alignment in Pfam-A were from several sources: Swissprot, Prosite, ProDom etc.). In the current release of Pfam, the entries in Pfam-A and Pfam-B are from Pfamseq(UniProtKB) and ADDA respectively.

The Pfam-A contains the well characterized entries with annotation. It starts with the building of the seed alignment with a few selected representative sequence members under manually quality checking. Then the HMMs is applied automatically to make full alignment and try to detect all the possible members for each initial family. The families/domains in Pfam-A are in high quality level and could be used as a reliable annotation/classification evidence for the query sequence.

The Pfam-B is created based on the sequence alignment of the entries from ADDA by using HMMs. Those entries existing already in Pfam-A are excluded. There are no confirmed annotation and no manual quality checking for the families in Pfam-B, therefore there could be some errors (e.g. the members in one family could be just randomly aligned) and the overall quality is relative low. However, it still can be useful for the situation that one can not find domain evidence in Pfam-A for the query sequence.

We used the "sequence search" feature of Pfam website with the FASTA-sequence of the protein to determine potential domains or domain families. Afterwards we checked out the corresponding page of the domain (family) for a GO annotation. The search was performed with the default settings (cut-off: use E-Value, threshold 1.0), but we also included Pfam-B in the search. Only one hit in Pfam-B was found which does not have any GO annotation and hence there was no gain in including Pfam-B. The classification in respect to the significance of a hit was done by the Pfam search algorithm.

ProtFun

ProtFun tries to assign a function to the query protein. For this purpose, it uses the prediction of several other features like post-translational modification sites or localization of the protein. The prediction of these features itself is based on other programs like SignalP, TargetP, NetOGlyc, TMHMM and some others. ProtFun was developed by Jensen et al. in 2002<ref name=jensen_1>Jensen et al., "Prediction of human protein function from post-translational modifications and localization features.", J Mol Biol. 2002 Jun 21, PubMed</ref> and the prediction of the Gene Ontology category was added in 2003<ref name=jensen_2>Jensen et al., "Prediction of human protein function according to Gene Ontology categories.", Bioinformatics. 2003 Mar 22, PubMed</ref>.

We used the webserver of ProtFun 2.2 with the default settings and the FASTA-sequence of the protein as the input. The output contains predictions about the functional category, enzyme/nonenzyme, enzyme class and the Gene Ontology category. In our case, only the result of the latter was relevant. The term 'Prob' represents the calculated probability by ProtFun that the query belongs to the category. This probability is dependent on the prior probability of the category. 'Odds' describes the odds that the query belongs to the certain category and is not influenced by the prior probability.<ref name=ProtFun>Explanation of the ProtFun 2.2 output.</ref> The class with the highest information content and with the highest probability is marked bold. Additionally we provide a table for each query that contains the categories with the highest information content or probability, respectively, and their associated GO id. For this purpose, we used the search feature of the Gene Ontology website.

Proteins

The results of the prorgams are listed for the five proteins in the following sections. If a prediction is correct, it will be marked with a green background. For this purpose, we used the GO annotation of the corresponding protein at the EBI website (see list of annotated GO ids).

GLA

GOPET

GOid Confidence GO term
GO:0016798 98% hydrolase activity acting on glycosyl bonds
GO:0004553 98% hydrolase activity hydrolyzing O-glycosyl compounds
GO:0016787 97% hydrolase activity
GO:0004557 96% alpha-galactosidase activit
GO:0008456 89% alpha-N-acetylgalactosaminidase activity

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Melibiase Family x Molecular function hydrolase activity, hydrolyzing O-glycosyl compounds GO:0004553
Pfam-A Melibiase Family x Biological process carbohydrate metabolic process GO:0005975

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.090    0.419
 Receptor                             0.014    0.083
 Hormone                              0.002    0.318
 Structural_protein                   0.004    0.127
 Transporter                          0.024    0.222
 Ion_channel                          0.010    0.169
 Voltage-gated_ion_channel            0.003    0.127
 Cation_channel                       0.010    0.215
 Transcription                        0.047    0.367
 Transcription_regulation             0.026    0.204
 Stress_response                      0.049    0.552
 Immune_response                      0.012    0.136
 Growth_factor                        0.006    0.412
 Metal_ion_transport                  0.009    0.020

Type GO category GO aspect GO id
Highest probablity Signal transducer Molecular function GO:0004871


BACR_HALSA

GOPET

GOid Confidence GO term
GO:0005216 77% ion channel activity
GO:0008020 75% G-protein coupled photoreceptor activity
GO:0015078 60% hydrogen ion transmembrane transporter activity

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Bacteriorhodopsin-like protein Domain x Cellular component membrane GO:0016020
Pfam-A Bacteriorhodopsin-like protein Domain x Molecular function ion channel activity GO:0005216
Pfam-A Bacteriorhodopsin-like protein Domain x Biological process ion transport GO:0006811
Pfam-A Domain of unknown function DUF21 Family - - -

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.258    1.205
 Receptor                             0.355    2.087
 Hormone                              0.001    0.206
 Structural_protein                   0.006    0.200
 Transporter                       => 0.440    4.036
 Ion_channel                          0.010    0.169
 Voltage-gated_ion_channel            0.004    0.172
 Cation_channel                       0.078    1.689
 Transcription                        0.026    0.205
 Transcription_regulation             0.028    0.226
 Stress_response                      0.012    0.139
 Immune_response                      0.011    0.128
 Growth_factor                        0.010    0.727
 Metal_ion_transport                  0.049    0.106

Type GO category GO aspect GO id
Highest information content / highest probability Transporter Molecular function GO:0005215


RET4_HUMAN

GOPET

GOid Confidence GO term
GO:0005488 90% binding
GO:0005501 81% retinoid binding
GO:0008289 80% lipid binding
GO:0019841 78% retinol binding
GO:0005215 78% transporter activity
GO:0016918 78% retinal binding
GO:0005319 69% lipid transporter activity
GO:0008035 60% high-density lipoprotein particle binding

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Lipocalin / cytosolic fatty-acid binding protein family Domain x Molecular function binding GO:0005488
Pfam-A DspF/AvrF protein Family - - -
Pfam-B PB008544 - - - - -

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.202    0.942
 Receptor                             0.147    0.862
 Hormone                              0.004    0.667
 Structural_protein                   0.002    0.058
 Transporter                          0.025    0.232
 Ion_channel                          0.016    0.288
 Voltage-gated_ion_channel            0.003    0.148
 Cation_channel                       0.010    0.215
 Transcription                        0.027    0.207
 Transcription_regulation             0.025    0.196
 Stress_response                      0.161    1.829
 Immune_response                   => 0.239    2.813
 Growth_factor                        0.023    1.617
 Metal_ion_transport                  0.009    0.020

Type GO category GO aspect GO id
Highest information content / highest probability Immune response Biological process GO:0006955


INSL5_HUMAN

GOPET

GOid Confidence GO term
GO:0005179 80% hormone activity

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Insulin/IGF/Relaxin family Domain x Cellular component extracellular region GO:0005576
Pfam-A Insulin/IGF/Relaxin family Domain x Molecular function hormone activity GO:0005179

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.374    1.746
 Receptor                             0.128    0.750
 Hormone                           => 0.247   37.936
 Structural_protein                   0.001    0.041
 Transporter                          0.025    0.228
 Ion_channel                          0.010    0.168
 Voltage-gated_ion_channel            0.003    0.131
 Cation_channel                       0.010    0.215
 Transcription                        0.054    0.425
 Transcription_regulation             0.091    0.724
 Stress_response                      0.099    1.128
 Immune_response                      0.178    2.090
 Growth_factor                        0.061    4.379
 Metal_ion_transport                  0.009    0.020

Type GO category GO aspect GO id
Highest information content Hormone Molecular function GO:0005179
Highest probability Signal transducer Molecular function GO:0004871


LAMP1_HUMAN

GOPET

GOid Confidence GO term
GO:0004812 60% aminoacyl-tRNA ligase activity
GO:0005524 60% ATP binding

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Lysosome-associated membrane glycoprotein Family x Cellular component membrane GO:0016020
Pfam-A Protein of unknown function DUF1180 Family - - -

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.396    1.849
 Receptor                             0.282    1.659
 Hormone                              0.001    0.206
 Structural_protein                   0.011    0.408
 Transporter                          0.024    0.222
 Ion_channel                          0.008    0.147
 Voltage-gated_ion_channel            0.002    0.111
 Cation_channel                       0.010    0.215
 Transcription                        0.032    0.247
 Transcription_regulation             0.018    0.142
 Stress_response                      0.246    2.795
 Immune_response                   => 0.371    4.368
 Growth_factor                        0.013    0.956
 Metal_ion_transport                  0.009    0.020

Type GO category GO aspect GO id
Highest information content Immune response Biological process GO:0006955
Highest probability Signal transducer Molecular function GO:0004871


A4_HUMAN

GOPET

GOid Confidence GO term
GO:0004866 87% endopeptidase inhibitor activity
GO:0004867 86% serine-type endopeptidase inhibitor activity
GO:0030568 83% plasmin inhibitor activity
GO:0030304 83% trypsin inhibitor activity
GO:0030414 82% peptidase inhibitor activity
GO:0005488 79% binding
GO:0005515 74% protein binding
GO:0046872 73% metal ion binding
GO:0003677 71% DNA binding
GO:0008201 70% heparin binding
GO:0008270 69% zinc ion binding
GO:0005507 69% copper ion binding
GO:0005506 67% iron ion binding

Pfam

Source Description Entry type Significant GO aspect GO description GO id
Pfam-A Amyloid A4 N-terminal heparin-binding Domain x Cellular component integral to membrane GO:0016021
Pfam-A Amyloid A4 N-terminal heparin-binding Domain x Molecular function binding GO:0005488
Pfam-A Copper-binding of amyloid precursor, CuBD Domain x - - -
Pfam-A Kunitz/Bovine pancreatic trypsin inhibitor domain Domain x Molecular function serine-type endopeptidase inhibitor activity GO:0004867
Pfam-A E2 domain of amyloid precursor protein Domain x - - -
Pfam-A Beta-amyloid peptide Family x Cellular component integral to membrane GO:0016021
Pfam-A Beta-amyloid peptide Family x Molecular function binding GO:0005488
Pfam-A beta-amyloid precursor protein C-terminus Family x - - -
Pfam-A Exonuclease VII, large subunit Family - - -
Pfam-A Transcriptional activator TraM Family - - -

ProtFun

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.126    0.586
 Receptor                             0.036    0.211
 Hormone                              0.001    0.206
 Structural_protein                => 0.034    1.205
 Transporter                          0.024    0.222
 Ion_channel                          0.009    0.162
 Voltage-gated_ion_channel            0.002    0.108
 Cation_channel                       0.010    0.215
 Transcription                        0.043    0.335
 Transcription_regulation             0.018    0.143
 Stress_response                      0.076    0.862
 Immune_response                      0.016    0.183
 Growth_factor                        0.005    0.372
 Metal_ion_transport                  0.009    0.020

Type GO category GO aspect GO id
Highest information content Structural protein Molecular function GO:0005198
Highest probability Signal transducer Molecular function GO:0004871

Evaluation of the Results

We used the Gene Ontology annotation of the corresponding protein of the EBI website as the reference for the evaluation of the programs (see list of annotated GO ids). Afterwards we determined the true positives, false positives, true negatives and false negatives and calculated the sensitivity and specificity. For this, we created Venn diagrams which we provide on this page. A true negative was defined as a false positive of one of the other two programs. "Overall" is the evaluation of all six proteins. The highest value of sensitivity/specificity in respect to a certain protein is marked with a green background.

Protein Program True positives False positives True negatives False negatives Sensitivity Specificity
GLA GOPET 4 1 1 18 0.18 0.5
Pfam 2 0 2 20 0.09 1
ProtFun 0 1 1 22 0 0.5
BACR_HALSA GOPET 1 2 1 11 0.08 0.33
Pfam 3 0 3 9 0.25 1
ProtFun 0 1 2 12 0 0.67
RET4_HUMAN GOPET 5 3 1 36 0.12 0.25
Pfam 1 0 4 40 0.02 1
ProtFun 0 1 3 41 0 0.75
INSL5_HUMAN GOPET 1 0 1 3 0.25 1
Pfam 2 0 1 2 0.5 1
ProtFun 1 1 0 3 0.25 0
LAMP1_HUMAN GOPET 0 2 2 17 0 0.5
Pfam 1 0 4 16 0.06 1
ProtFun 0 2 2 17 0 0.5
A4_HUMAN GOPET 7 6 2 71 0.09 0.25
Pfam 3 0 8 75 0.04 1
ProtFun 0 2 6 78 0 0.75
Overall GOPET 18 14 8 156 0.10 0.36
Pfam 12 0 22 162 0.07 1
ProtFun 1 8 14 171 0 0.64


There are two things remarkable. First, Pfam does not have a single false positive prediction and hence the specificity is 1. This leads us to the conclusion that the search feature has also a very good specificity and that the annotation of the domains and families in Pfam is also very accurate in respect to Gene Ontology. Second, ProtFun achieved only one true positive and thus the sensitivity is close to 0. This can be explained due to the fact that ProtFun only predicts very general Gene Ontology categories (e.g. immune response, receptor, etc.) and therefore the prediction misses alot of subannotations. Nevertheless five out of six predictions of the main categories were also wrong and therefore we would not recommend to use ProtFun for a prediction of GO terms.

In respect to the sensitivity, GOPET achieves the highest value with 0.10 which is still very low. Since the overall sensitivity of Pfam is slighty lower (0.07) and the specificity of Pfam is 1, we would prefer Pfam for the prediction of GO term. Overall the sensitivity is very low and hence you have to keep in mind that you will miss a lot of GO terms when using one of these programs.

List of annotated GO ids

References

<references />