Sequence-based predictions GLA
by Benjamin Drexler and Fabian Grandke
Find additional information and graphics here.
Contents
Proteins
Protein | GLA | BACR_HALSA | RET4_HUMAN | INSL5_HUMAN | LAMP1_HUMAN | A4_HUMAN |
---|---|---|---|---|---|---|
Organism | Homo sapiens | Halobacterium salinarium | Homo sapiens | Homo sapiens | Homo sapiens | Homo sapiens |
Size | 429 AA | 262 AA | 201 AA | 135 AA | 417 AA | 770 AA |
Subcellular location | Lysosome | Cell membrane | Secreted | Secreted | Cell membrane | Membrane |
Function | Glycosidase | Photoreceptor protein | Sensory transduction | Hormone | Carbohydrate presentation | Serine protease inhibitor |
Transmembrane | No | 7 transmembrane regions | No | No | 1 transmembrane region | 1 transmembrane region |
UniProt entry | P06280 | P02945 | P02753 | Q9Y5Q6 | P11279 | P05067 |
Secondary structure prediction
PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/
PSIPRED was developed by David T. Jones at the University of Warwick in 1998. Nowadays the server runs at the University College London. <ref name=PSIPRED>History of PSIPRED</ref>
PSIPRED predicts secondary structures based on neuronal networks with a single hidden layer and feed-forward back-propagation.<ref name=Praesi>Talk_Task3</ref> The workflow can be split into three states:
- Sequence profiles generation: Neuronal network gets position-specific matrix from PSI-Blast as input
- Initial secondary structure prediction: Output Layer predicts one of the three secondary structures
- Predicted structure filtering: Additional network filters the raw predictions from the previous step
PSIPRED takes an amino acid sequence as input. The output is a the predicted secondary structure as shown in Figure 1.
Prediction
Figure 1 shows 10 alpha helices, 16 beta strands and 27 coils.
Jpred3
http://www.compbio.dundee.ac.uk/www-jpred/index.html
Jpred3 was developed by C. Cole at the University of Dundee. Similar to PSIPRED a neuronal network is used for the prediction of the secondary structure. For single sequences as input the program uses PSI-Blast sequence profiles, as well. Jpred3 is also capable of taking multiple sequence alignments as input. Both are further processed using the Jpred algorithm. <ref name=jpred>Cole et al., " The Jpred 3 secondary structure prediction server.", Nucleic acids research. 2008 Jul 1, PubMed</ref>
The table below shows the PDB entries found by Jpred3 concerning our input sequence:
EBI | Chain | Describtion | E-value |
---|---|---|---|
3hg5 | B | Alpha-galactosidase A | 0.0 |
3hg5 | A | Alpha-galactosidase A | 0.0 |
3hg4 | B | Alpha-galactosidase A | 0.0 |
3hg4 | A | Alpha-galactosidase A | 0.0 |
3hg2 | B | Alpha-galactosidase A | 0.0 |
3hg2 | A | Alpha-galactosidase A | 0.0 |
3gxt | B | Alpha-galactosidase A | 0.0 |
3gxt | A | Alpha-galactosidase A | 0.0 |
3gxp | B | Alpha-galactosidase A | 0.0 |
3gxp | A | Alpha-galactosidase A | 0.0 |
3gxn | B | Alpha-galactosidase A | 0.0 |
3gxn | A | Alpha-galactosidase A | 0.0 |
1r47 | B | Alpha-galactosidase A | 0.0 |
1r47 | A | Alpha-galactosidase A | 0.0 |
1r46 | B | Alpha-galactosidase A | 0.0 |
1r46 | A | Alpha-galactosidase A | 0.0 |
3hg3 | B | Alpha-galactosidase A | 0.0 |
3hg3 | A | Alpha-galactosidase A | 0.0 |
3lxc | B | Alpha-galactosidase A | 0.0 |
3lxc | A | Alpha-galactosidase A | 0.0 |
3lxb | B | Alpha-galactosidase A | 0.0 |
3lxb | A | Alpha-galactosidase A | 0.0 |
3lxa | B | Alpha-galactosidase A | 0.0 |
3lxa | A | Alpha-galactosidase A | 0.0 |
3lx9 | B | Alpha-galactosidase A | 0.0 |
3lx9 | A | Alpha-galactosidase A | 0.0 |
1ktc | A | alpha-N-acetylgalactosaminidase | e-113 |
1ktb | A | alpha-N-acetylgalactosaminidase | e-113 |
3igu | B | Alpha-N-acetylgalactosaminidase | e-100 |
3igu | A | Alpha-N-acetylgalactosaminidase | e-100 |
3h55 | B | Alpha-N-acetylgalactosaminidase | e-100 |
3h55 | A | Alpha-N-acetylgalactosaminidase | e-100 |
3h54 | B | Alpha-N-acetylgalactosaminidase | e-100 |
3h54 | A | Alpha-N-acetylgalactosaminidase | e-100 |
3h53 | B | Alpha-N-acetylgalactosaminidase | e-100 |
3h53 | A | Alpha-N-acetylgalactosaminidase | e-100 |
The lightblue colored protein is the protein that was used as query sequence.
We ignored those results and forced Jpred to make a prediction anyway. The programs output was a prediction of the secondary structure among others.
Comparison with DSSP
http://swift.cmbi.ru.nl/servers/html/
DSSP was developed by W.Kabsch and C. Sander in 1983. It is a database, containing secondary structure assignments for each protein in PDB. As it is no prediction program itself, it is used to compare the results of a prediction with the data in DSSP. Therefor it uses the 3D coordinates from PDB entries and used them to calculate DSSP entries.<ref name="dssp">Kabsch et al., "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.", Biopolymers, 1983,PubMed</ref>
Figure 2 is a color labeled picture of the prediction by DSSP. The letter code used by DSSP has been translated to the 3-letter code of H = helix, E = strand and C = coil.
Find a pdf version of Figure 2 here: File:GLA DSSP Comp.pdf
Results
Structural Element | PSIPRED | Jpred3 | DSSP |
---|---|---|---|
Helices | 10 | 10 | 12 |
Strands | 16 | 15 | 16 |
Coils | 27 | 26 | 29 |
The results of Jpred3 and PSIPRED are very similar to the reference prediction by DSSP. Although they are not entirely equal, the predictions are very close and thus can be labeled successful.
Prediction of disordered regions
As GLA was not found in the DisProt database we tried to predict disordered regions with several tools.
DISOPRED
An online version of DISOPRED is available at http://bioinf.cs.ucl.ac.uk/disopred/, but we run it locally, as well. Therefor we first had to adapt some paths and then we run it using the command sudo ./runpsipred
.
DISOPRED was developed by JJ. Ward, JS. Sodhi, LJ. McGuffin, BF. Buxton and DT. Jones in 2004. A neuronal network is used to predict disordered regions. As it is a knowledge-based method, the DISOPRED neuronal network is trained with X-ray structures from PDB. The program takes a sequence as input and runs PSI-Blast against a database. The trained neuronal network predicts residuewise profiles and classifies them as disordered or not disordered. <ref name=disopred>Ward et al., "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life", Journal of Molecular Biology. 2004, PubMed</ref> <ref name=Praesi>Talk_Task3</ref>
Figure 3 shows that DISOPRED predicts two disordered regions. One is at the very beginning of the protein sequence the other one is at the end. The lightgrey dotted line shows, that another region at the position ~35 was predicted, as well, but as DISOPRED filters the results, the peak is smoothed out.
POODLE
http://mbs.cbrc.jp/poodle/poodle.html
POODLE was developed by S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda and T. Noguchi in 2007. It predicts disordered regions by usage of a machine-learning approach.
There are four different variants available.:<ref name=poodlehp>POODLE Help Page</ref>
- POODLE-L<ref name=poodle_l>Hirose et al., " POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.", PharmaDesign. 2007, PubMed</ref>: Predicts long disordered regions(>40 consecutive residues).
- POODLE-I<ref name=poodle_i>POODLE-I</ref>: Uses structural information predictors based on a work-flow approach.
- POODLE-S<ref name=poodle_s>Shimizu et al., "POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.", Bioinformatics. 2007, PubMed</ref>: Predicts short disordered regions. Has two subversions differing in the preparation of the databases:
- Missing residues: Missing regions in X-ray structures.
- High B-Factor residues: Regions with high B-Factors.
- POODLE-W<ref name=poodle_w>Shimizu et al., "Predicting mostly disordered proteins by using structure-unknown protein data
.", Bioinformatics. 2007, PubMed</ref>: Specialized on mostly disordered proteins.
For the analysis only POODLE-S was used, because not even short disordered regions were found in our protein, so long regions are even more unlikely. All variants only need an amino acid sequence as input and provide as well a picture as a textual output of the predicted regions.
POODLE-S: Missing residues
Figure 4 shows two disordered regions at both ends of the amino acid sequence, predicted by POODLE-S using the "Missing Residues" option. Additionally there are two peaks, at position ~35 and ~400 that are near to the disordered cutoff value.
POODLE-S: High B-Factor residues
Figure 5 shows the prediction by POODLE-S using the "High B-Factor residues" option. Only one disordered region is predicted. It is at the very beginning of the protein sequence. There are manly small peaks in the curve, so the result is very noisy.
IUPRED
http://iupred.enzim.hu/index.html
IUPRED was developed by Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon in 2005. It is a prediction method for ordered and disordered regions in protein sequences. IUPRED is based on an energy function and the assumption that there are less interresidue interactions in disordered regions. <ref name=poodle_l>Dosztányi et al., "IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.", Bioinformatics. 2005, PubMed</ref> The input of IUPRED is just an amino acid sequence. The output is a file containing predicted values for each position in the sequence. We used R-scripts to create the Figures.
Short Disorder
Figure 6 shows three disordered regions, predicted by IUPRED using the "Short Disorder" option. The first one is at the beginning of the sequence, the next one at position ~107 and the last one at the end of the sequence. Additionally, there are two significant peaks at positions ~275 and ~320, but the predicted values are smaller than the cutoff, so they are not predicted as disordered regions.
Long Disorder
Figure 7 shows the noisy prediction of IUPRED in the "Long Disorder" mode. It predicts only one disordered region at position ~107.
META-Disorder
http://www.predictprotein.org/
Hint: You will have to register. It is free of charge, but you can submit max. 3 sequences within the next 12 months!
The program takes an amino acid sequence as input. The resulting file consists of predictions of the programs described below. We used an R-script to create the figure. Metadisorder was developed by Avner Schlessinger and Burkhard Rost in 2005 at the columbia university. It combines different methods and uses various sources of information to predict disordered regions. Metadisorder makes use of the methods described below <ref name=MD>Schlessinger et al., "Improved Disorder Prediction by Combination of Orthogonal Approaches.", PLoS ONE. 2009, PLoS ONE</ref> <ref name=rostlab_MD>Rostlab - Metadisorder</ref>:
PROFbval
PROFbval is a residue mobility prediction method based on the amino-acid sequence. <ref name=rostlab_prof>Rostlab - PROFbval</ref>
NORSnet
NORSnet is a method that identifies unstructured loops, based on neuronal networks. <ref name=rostlab_nors>Rostlab - NORSnet</ref>
UCON
UCON predicts natively unstructured regions through contacts. <ref name=rostlab_ucon>Rostlab - UCON</ref>
Figure 8 shows the results by META-Disorder(black line). Only one disordered region is predicted at the very beginning of the sequence.
Evaluation of the results
Conlusion of the results of the predictions of disordered regions:
Position | Disopred | POODLE-S(MR) | POODLE-S(HBF) | IUPRED(SHORT) | IUPRED(LONG) | META-Disorder |
---|---|---|---|---|---|---|
~3 | ||||||
~107 | ||||||
~429 |
The disordered region at the beginning of the sequence is predicted by all but one methods. The one at the end is even predicted by 50% of the used programs. Unfortunately the prediction of disordered regions at the periphery of protein sequences is often positive, but wrong because of the character of electron density maps. The disordered region at position ~107 is only predicted by the two IUPRED methods and can not be confirmed by other programs. Thus a true disordered region at that position is very unlikely, as well. Summarizing, there is no disordered region in the protein.
Prediction of transmembrane alpha-helices and signal peptides
Programs
All programs but TMHMM were run on web servers and only take an amino acid sequence as input.
TMHMM
TMHMM is a program that predicts transmembrane helices in proteins. It is based on a hidden Markov model and was developed by Sonnhammer et. al in 1998<ref name=sonnhammer>Sonnhammer et al., "A hidden Markov model for predicting transmembrane helices in protein sequences.", Proc Int Conf Intell Syst Mol Biol. 1998, PubMed</ref>.
We used the webserver of TMHMM with the FASTA-sequence of the protein as input. The output contains some statistics (e.g. number of TM helices, expected number of AAs in TM helices, probability that the N-term is on the cytoplasmic membrane), a listening of the labeled sequence areas (e.g. inside, outside, TM helix) and a plot of the probabilities for the residues. Additionally, if it is predicted that there is a great number of the first 60 AA are part of a TM helix, TMHMM will indicate that there could be a signal peptide in the N-term region.
Phobius/Polyphobius
Phobius was developed by Käll, Krogh and Sonnhammer in 2004 at the Stockholm Bioinformatics Center. It is based on HMMs and predicts transmembrane helices and signal-peptides at the N-terminal. Phobius takes a protein sequence(FASTA format) as input an outputs diagrams and text files containing the predictions. <ref name=phobius>Käll et al., "A combined transmembrane topology and signal peptide prediction method.", J Mol Biol. 2004, PubMed</ref>
Polyphobius was developed by L. Käll, A. Krogh, EL. Sonnhammer in 2005 at the Stockholm Bioinformatics Center. It is very similar to Phobius and uses HMMs, as well. Additionally Polyphobius uses information from homologous sequences to improve the prediction accuracy. <ref name=polyphobius>Käll et al., "An HMM posterior decoder for sequence feature prediction that includes homology information.", Bioinformatics. 2005, PubMed</ref>
We used the web server for both versions of the program.
Octopus/Spoctopus
Octopus was developed by H. Viklund and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It combines HMMs and artificial neuronal networks(ANN) to predict the topology of transmembrane proteins. At first several ANNs are used to make predictions for every single residue and afterwards HMMs are used to smooth the results and combine them to a useful prediction.<ref name=octopus>Viklund et al., "OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar.", Bioinformatics. 2008, PubMed</ref>
Spoctopus was developed by H. Viklund, A. Bernsel, M. Skwark and A. Elofsson in 2008 at the Stockholm Bioinformatics Center. It is almost identical to Octopus, but additionally predicts the signal peptide.<ref name=spoctopus>Viklund et al., "SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology.", Bioinformatics. 2008, PubMed</ref>
We used the web server for both versions of the program.
TargetP
TargetP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It uses neuronal networks to predict the subcellular location of eukaryotic proteins. The method is based on the predicted presence of any of the N-terminal presequences (i.e.:chloroplast transit peptide, mitochondrial targeting peptide or secretory pathway signal peptide).<ref name=targetp>Emanuellson et al., "Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.", J Mol Biol. 2000, PubMed</ref>
We used the web server of the program.
SignalP
SignalP was developed by H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne in 1997 at the Stockholm Bioinformatics Center. It predicts presence of a signal peptide and provides information about hte exact location of signal peptide cleavage sites. The most recent version of the method uses as well AMMs as HMMs. <ref name=signalp>Nielsen et al., "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.", Protein engineering. 1997, PubMed</ref>
We used the web server of the program.
Proteins
In the following sections we present the results of the programs for the proteins, i.e. the prediction of transmembrane alpha-helices and signal peptides. The sequence annotation of the UniProt entries (see protein overview) were used as the reference. For the colouring of the tables, we used the following scheme:
- green (completely correct): the predicted positions are allowed to differ one position per index, e.g. a prediction of the positions 19 - 31 will be marked green, if the reference is 20 - 30.
- exception: TMHMM per se is not able to predict signal peptides and hence it will include the N-terminal of the protein in the first region almost always even though it is a signal peptide. Therefore we also declare the prediction of the first region correct, when it starts at the first residue and the end residue of the first region is correct.
- yellow (partial correct): a vast majority of the residues is assigned to the correct region.
- red (wrong): the prediction is completely wrong, e.g. a region is predicted which does not exist.
General evaluation
TMHMM does not predict signal peptides and therefore it often includes the beginning of the protein into the first region by mistake. It also made an error by missing a transmembrane alpha-helix and Phobius/Polyphobius often had more accurate predictions.
Octopus has a similar problem to TMHMM, since it also does not predict signal peptides and hence it predicts a transmembrane alpha-helix for the signal peptide. On the other hand, Spoctopus predicts the transmembrane regions very good, but is always missing the beginning of the signal peptide.
Phobius and Polyphobius made very good predictions in respect to transmembrane alpha-helices and the predictions of the signal peptides were completely correct. Since it delivers the whole package (i.e. transmembrane alpha-helix prediction, signal peptide prediction, highst accurarcy), we prefer to work with these programs.
GLA
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Cytoplasmic | - | - | - | - | 1 - 9 | - |
Transmembrane | - | - | - | - | 10 - 30 | - |
Non-cytoplasmic | 32 - 429 | 1 - 429 | 32 - 429 | 32 - 429 | 31 - 429 | 32 - 429 |
No program predicted a transmembrane alpha-helix except for Octopus which predicts the signal peptide as a cytoplasmic and transmembrane region. We also observe this problem of Octopus for all of the other five proteins.
Signal peptides
Region | Reference | Phobius | Polyphobius | Spoctopus | SignalP |
---|---|---|---|---|---|
Signal peptide | 1 - 31 | 1 - 31 | 1 - 31 | 11 - 31 | 1 - 31 |
N-Region | - | 1 - 9 | 1 - 12 | - | - |
H-Region | - | 10 - 22 | 13 - 26 | - | - |
C-Region | - | 23 - 31 | 27 - 31 | - | - |
All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.
BACR_HALSA
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Non-cytoplasmic | 14 - 23 | 1 - 22 | 1 - 22 | 1 - 21 | 1 - 22 | 1 - 22 |
Transmembrane | 24 - 42 | 23 - 42 | 23 - 42 | 22 - 43 | 23 - 43 | 23 - 43 |
Cytoplasmic | 43 - 56 | 43 - 54 | 43 - 53 | 44 - 54 | 44 - 54 | 44 - 54 |
Transmembrane | 57 - 75 | 55 - 77 | 54 - 76 | 55 - 77 | 55 - 75 | 55 - 75 |
Non-cytoplasmic | 76 - 91 | 78 - 91 | 77 - 95 | 78 - 94 | 76 - 95 | 76 - 95 |
Transmembrane | 92 - 109 | 92 - 114 | 96 - 114 | 95 - 114 | 96 - 116 | 96 - 116 |
Cytoplasmic | 110 - 120 | 115 - 120 | 115 - 120 | 115 - 120 | 117 - 121 | 117 - 120 |
Transmembrane | 121 - 140 | 121 - 143 | 121 - 142 | 121 - 141 | 122 - 142 | 121 - 141 |
Non-cytoplasmic | 141 - 147 | 144 - 147 | 143 - 147 | 142 - 147 | 143 - 147 | 142 - 147 |
Transmembrane | 148 - 167 | 148 - 170 | 148 - 169 | 148 - 166 | 148 - 168 | 148 - 168 |
Cytoplasmic | 168 - 185 | 171 - 189 | 170 - 189 | 167 - 186 | 169 - 185 | 169 - 185 |
Transmembrane | 186 - 204 | 190 - 212 | 190 - 212 | 187 - 205 | 186 - 206 | 186 - 206 |
Non-cytoplasmic | 205 - 216 | 213 - 262 | 213 - 217 | 206 - 215 | 207 - 216 | 207 - 216 |
Transmembrane | 217 - 236 | - | 218 - 237 | 216 - 237 | 217 - 237 | 217 - 237 |
Cytoplasmic | 237 - 262 | - | 238 - 262 | 238 - 262 | 238 - 262 | 238 - 262 |
All programs predict the correct number of transmembrane alpha-helices apart from TMHMM which misses the last one. Polyphobius has the best result since it has the highest number of completely correct predicted regions.
Signal peptides
No program predicted a signal peptide which is correct.
RET4_HUMAN
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Cytoplasmic | - | - | - | - | 1 - 1 | - |
Transmembrane | - | - | - | - | 2 - 23 | - |
Non-cytoplasmic | 19 - 201 | 1 - 201 | 19 - 201 | 19 - 201 | 24 - 201 | 20 - 201 |
All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.
Signal peptides
Region | Reference | Phobius | Polyphobius | Spoctopus | SignalP |
---|---|---|---|---|---|
Signal peptide | 1 - 18 | 1 - 18 | 1 - 18 | 6 - 19 | 1 - 18 |
N-Region | - | 1 - 2 | 1 - 3 | - | - |
H-Region | - | 3 - 13 | 4 - 13 | - | - |
C-Region | - | 14 - 18 | 14 - 18 | - | - |
All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.
INSL5_HUMAN
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Cytoplasmic | - | - | - | - | 1 - 1 | - |
Transmembrane | - | - | - | - | 2 - 32 | - |
Non-cytoplasmic | 23 - 135 | 1 - 135 | 23 - 135 | 23 - 135 | 33 - 135 | 24 - 135 |
All programs predict correctly that this protein does not have a transmembrane alpha-helix except for Octopus.
Signal peptides
Region | Reference | Phobius | Polyphobius | Spoctopus | SignalP |
---|---|---|---|---|---|
Signal peptide | 1 - 22 | 1 - 22 | 1 - 22 | 6 - 23 | 1 - 22 |
N-Region | - | 1 - 5 | 1 - 4 | - | - |
H-Region | - | 6 - 17 | 5 - 16 | - | - |
C-Region | - | 18 - 22 | 17 - 22 | - | - |
All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.
LAMP1_HUMAN
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Cytoplasmic | - | 1 - 10 | - | - | 1 - 10 | - |
Transmembrane | - | 11 - 33 | - | - | 11 - 31 | - |
Non-cytoplasmic | 29 - 382 | 34 - 383 | 29 - 381 | 29 - 381 | 32 - 383 | 30 - 383 |
Transmembrane | 383 - 405 | 384 - 406 | 382 - 405 | 382 - 405 | 384 - 404 | 384 - 404 |
Cytoplasmic | 406 - 417 | 407 - 417 | 406 - 417 | 406 - 417 | 405 - 417 | 405 - 417 |
Phobius, Polyphobius and Spoctopus predict the one transmembrane alpha-helix correctly. Octopus has the common error that it assigns a cytoplasmic and transmembrane region to the signal peptide of the protein. TMHMM also predicts a transmembrane region for the signal peptide. The developers of TMHMM indicate this problem on the instruction page of TMHMM.
Signal peptides
Region | Reference | Phobius | Polyphobius | Spoctopus | SignalP |
---|---|---|---|---|---|
Signal peptide | 1 - 28 | 1 - 28 | 1 - 28 | 12 - 29 | 1 - 28 |
N-Region | - | 1 - 10 | 1 - 9 | - | - |
H-Region | - | 11 - 22 | 10 - 22 | - | - |
C-Region | - | 23 - 28 | 23 - 28 | - | - |
All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.
A4_HUMAN
The graphical output of the programs is provided on this page.
Transmembrane alpha-helices
Region | Reference | TMHMM | Phobius | Polyphobius | Octopus | Spoctopus |
---|---|---|---|---|---|---|
Cytoplasmic | - | - | - | - | 1 - 5 | - |
Transmembrane | - | - | - | - | 6 - 11 | - |
Non-cytoplasmic | 18 - 699 | 1 - 700 | 18 - 700 | 18 - 700 | 12 - 701 | 19 - 701 |
Transmembrane | 700 - 723 | 701 - 723 | 701 - 723 | 701 - 723 | 702 - 722 | 702 - 722 |
Cytoplasmic | 724 - 770 | 724 - 770 | 724 - 770 | 724 - 770 | 723 - 770 | 723 - 770 |
Every program predicts the one transmembrane alpha-helix of the protein except for Octopus.
Signal peptides
Region | Reference | Phobius | Polyphobius | Spoctopus | SignalP |
---|---|---|---|---|---|
Signal peptide | 1 - 17 | 1 - 17 | 1 - 17 | 5 - 18 | 1 - 17 |
N-Region | - | 1 - 1 | 1 - 3 | - | - |
H-Region | - | 2 - 12 | 4 - 12 | - | - |
C-Region | - | 13 - 17 | 13 - 17 | - | - |
All programs recognize the signal peptide, but only Spoctopus does not include the beginning of the sequence.
TargetP
The result TargetP is shown in the following table. A explanation of the output is given on this page.
Name | Length | mTP | SP | other | Loc | RC |
---|---|---|---|---|---|---|
GLA | 429 | 0.041 | 0.860 | 0.141 | S | 2 |
BACR_HALSA | 262 | 0.019 | 0.897 | 0.562 | S | 4 |
RET4_HUMAN | 201 | 0.242 | 0.928 | 0.020 | S | 2 |
INSL5_HUMA | 135 | 0.074 | 0.899 | 0.037 | S | 1 |
LAMP1_HUMA | 417 | 0.043 | 0.953 | 0.017 | S | 1 |
A4_HUMAN | 770 | 0.035 | 0.937 | 0.084 | S | 1 |
TargetP predicts for all six proteins a signal peptide and therefore assigns the protein to the secretory pathway. This prediction is correct for every protein.
Prediction of GO terms
GOPET
GOPET stands for Gene Ontology term Prediction and Evaluation Tool and was developed by Vinayagam et al. in 2006<ref name=vinayagam>Vinayagam et al., "GOPET: a tool for automated predictions of Gene Ontology terms.", BMC Bioinformatics. 2006 Mar 20, PubMed</ref>. It is based on homology searches on GO-mapped protein databases and uses support vector machines for the calculation of the confidence values.
We used the webserver of GOPET with the default settings (GO aspect: molecular function, maximum number of predictions: 20, confidence threshold: 60, GOPET model 2007 june, version 2.0, GOPET database 2007) and the FASTA-sequence of the protein as input. The results only contain GOids of the GO aspect "molecular function", since the other two GO aspects (cellular component and biological process) were not available.
Pfam
Pfam is a database composed of the protein domain families that is created by using Hidden Markov Models profiles (HMMs) and was first described by Sonnhammer et al. in 1997<ref name=sonnhammer>Sonnhammer et al., "Pfam: a comprehensive database of protein domain families based on seed alignments.", Proteins. 1997 Jul, PubMed</ref>. Each protein domain family is represented by a multiple sequence alignment and a HMMs. One can search one protein sequence against Pfam and obtain all the possible domains that the query sequence might contain.
Pfam database includes two parts A and B where the protein domain families with different quality levels. In the 1.0 release of Pfam, the protein entries in Pfam-A and Pfam-B were from Swissprot (a few initial members of seed alignment in Pfam-A were from several sources: Swissprot, Prosite, ProDom etc.). In the current release of Pfam, the entries in Pfam-A and Pfam-B are from Pfamseq(UniProtKB) and ADDA respectively.
The Pfam-A contains the well characterized entries with annotation. It starts with the building of the seed alignment with a few selected representative sequence members under manually quality checking. Then the HMMs is applied automatically to make full alignment and try to detect all the possible members for each initial family. The families/domains in Pfam-A are in high quality level and could be used as a reliable annotation/classification evidence for the query sequence.
The Pfam-B is created based on the sequence alignment of the entries from ADDA by using HMMs. Those entries existing already in Pfam-A are excluded. There are no confirmed annotation and no manual quality checking for the families in Pfam-B, therefore there could be some errors (e.g. the members in one family could be just randomly aligned) and the overall quality is relative low. However, it still can be useful for the situation that one can not find domain evidence in Pfam-A for the query sequence.
We used the "sequence search" feature of Pfam website with the FASTA-sequence of the protein to determine potential domains or domain families. Afterwards we checked out the corresponding page of the domain (family) for a GO annotation. The search was performed with the default settings (cut-off: use E-Value, threshold 1.0), but we also included Pfam-B in the search. Only one hit in Pfam-B was found which does not have any GO annotation and hence there was no gain in including Pfam-B. The classification in respect to the significance of a hit was done by the Pfam search algorithm.
ProtFun
ProtFun tries to assign a function to the query protein. For this purpose, it uses the prediction of several other features like post-translational modification sites or localization of the protein. The prediction of these features itself is based on other programs like SignalP, TargetP, NetOGlyc, TMHMM and some others. ProtFun was developed by Jensen et al. in 2002<ref name=jensen_1>Jensen et al., "Prediction of human protein function from post-translational modifications and localization features.", J Mol Biol. 2002 Jun 21, PubMed</ref> and the prediction of the Gene Ontology category was added in 2003<ref name=jensen_2>Jensen et al., "Prediction of human protein function according to Gene Ontology categories.", Bioinformatics. 2003 Mar 22, PubMed</ref>.
We used the webserver of ProtFun 2.2 with the default settings and the FASTA-sequence of the protein as the input. The output contains predictions about the functional category, enzyme/nonenzyme, enzyme class and the Gene Ontology category. In our case, only the result of the latter was relevant. The term 'Prob' represents the calculated probability by ProtFun that the query belongs to the category. This probability is dependent on the prior probability of the category. 'Odds' describes the odds that the query belongs to the certain category and is not influenced by the prior probability.<ref name=ProtFun>Explanation of the ProtFun 2.2 output.</ref> The class with the highest information content and with the highest probability is marked bold. Additionally we provide a table for each query that contains the categories with the highest information content or probability, respectively, and their associated GO id. For this purpose, we used the search feature of the Gene Ontology website.
Proteins
The results of the prorgams are listed for the five proteins in the following sections. If a prediction is correct, it will be marked with a green background. For this purpose, we used the GO annotation of the corresponding protein at the EBI website (see list of annotated GO ids).
GLA
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0016798 | 98% | hydrolase activity acting on glycosyl bonds |
GO:0004553 | 98% | hydrolase activity hydrolyzing O-glycosyl compounds |
GO:0016787 | 97% | hydrolase activity |
GO:0004557 | 96% | alpha-galactosidase activit |
GO:0008456 | 89% | alpha-N-acetylgalactosaminidase activity |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Melibiase | Family | x | Molecular function | hydrolase activity, hydrolyzing O-glycosyl compounds | GO:0004553 |
Pfam-A | Melibiase | Family | x | Biological process | carbohydrate metabolic process | GO:0005975 |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.090 0.419
Receptor 0.014 0.083
Hormone 0.002 0.318
Structural_protein 0.004 0.127
Transporter 0.024 0.222
Ion_channel 0.010 0.169
Voltage-gated_ion_channel 0.003 0.127
Cation_channel 0.010 0.215
Transcription 0.047 0.367
Transcription_regulation 0.026 0.204
Stress_response 0.049 0.552
Immune_response 0.012 0.136
Growth_factor 0.006 0.412
Metal_ion_transport 0.009 0.020
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest probablity | Signal transducer | Molecular function | GO:0004871 |
BACR_HALSA
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0005216 | 77% | ion channel activity |
GO:0008020 | 75% | G-protein coupled photoreceptor activity |
GO:0015078 | 60% | hydrogen ion transmembrane transporter activity |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Bacteriorhodopsin-like protein | Domain | x | Cellular component | membrane | GO:0016020 |
Pfam-A | Bacteriorhodopsin-like protein | Domain | x | Molecular function | ion channel activity | GO:0005216 |
Pfam-A | Bacteriorhodopsin-like protein | Domain | x | Biological process | ion transport | GO:0006811 |
Pfam-A | Domain of unknown function DUF21 | Family | - | - | - |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.258 1.205
Receptor 0.355 2.087
Hormone 0.001 0.206
Structural_protein 0.006 0.200
Transporter => 0.440 4.036
Ion_channel 0.010 0.169
Voltage-gated_ion_channel 0.004 0.172
Cation_channel 0.078 1.689
Transcription 0.026 0.205
Transcription_regulation 0.028 0.226
Stress_response 0.012 0.139
Immune_response 0.011 0.128
Growth_factor 0.010 0.727
Metal_ion_transport 0.049 0.106
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest information content / highest probability | Transporter | Molecular function | GO:0005215 |
RET4_HUMAN
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0005488 | 90% | binding |
GO:0005501 | 81% | retinoid binding |
GO:0008289 | 80% | lipid binding |
GO:0019841 | 78% | retinol binding |
GO:0005215 | 78% | transporter activity |
GO:0016918 | 78% | retinal binding |
GO:0005319 | 69% | lipid transporter activity |
GO:0008035 | 60% | high-density lipoprotein particle binding |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Lipocalin / cytosolic fatty-acid binding protein family | Domain | x | Molecular function | binding | GO:0005488 |
Pfam-A | DspF/AvrF protein | Family | - | - | - | |
Pfam-B | PB008544 | - | - | - | - | - |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.202 0.942
Receptor 0.147 0.862
Hormone 0.004 0.667
Structural_protein 0.002 0.058
Transporter 0.025 0.232
Ion_channel 0.016 0.288
Voltage-gated_ion_channel 0.003 0.148
Cation_channel 0.010 0.215
Transcription 0.027 0.207
Transcription_regulation 0.025 0.196
Stress_response 0.161 1.829
Immune_response => 0.239 2.813
Growth_factor 0.023 1.617
Metal_ion_transport 0.009 0.020
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest information content / highest probability | Immune response | Biological process | GO:0006955 |
INSL5_HUMAN
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0005179 | 80% | hormone activity |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Insulin/IGF/Relaxin family | Domain | x | Cellular component | extracellular region | GO:0005576 |
Pfam-A | Insulin/IGF/Relaxin family | Domain | x | Molecular function | hormone activity | GO:0005179 |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.374 1.746
Receptor 0.128 0.750
Hormone => 0.247 37.936
Structural_protein 0.001 0.041
Transporter 0.025 0.228
Ion_channel 0.010 0.168
Voltage-gated_ion_channel 0.003 0.131
Cation_channel 0.010 0.215
Transcription 0.054 0.425
Transcription_regulation 0.091 0.724
Stress_response 0.099 1.128
Immune_response 0.178 2.090
Growth_factor 0.061 4.379
Metal_ion_transport 0.009 0.020
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest information content | Hormone | Molecular function | GO:0005179 |
Highest probability | Signal transducer | Molecular function | GO:0004871 |
LAMP1_HUMAN
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0004812 | 60% | aminoacyl-tRNA ligase activity |
GO:0005524 | 60% | ATP binding |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Lysosome-associated membrane glycoprotein | Family | x | Cellular component | membrane | GO:0016020 |
Pfam-A | Protein of unknown function DUF1180 | Family | - | - | - |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.396 1.849
Receptor 0.282 1.659
Hormone 0.001 0.206
Structural_protein 0.011 0.408
Transporter 0.024 0.222
Ion_channel 0.008 0.147
Voltage-gated_ion_channel 0.002 0.111
Cation_channel 0.010 0.215
Transcription 0.032 0.247
Transcription_regulation 0.018 0.142
Stress_response 0.246 2.795
Immune_response => 0.371 4.368
Growth_factor 0.013 0.956
Metal_ion_transport 0.009 0.020
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest information content | Immune response | Biological process | GO:0006955 |
Highest probability | Signal transducer | Molecular function | GO:0004871 |
A4_HUMAN
GOPET
GOid | Confidence | GO term |
---|---|---|
GO:0004866 | 87% | endopeptidase inhibitor activity |
GO:0004867 | 86% | serine-type endopeptidase inhibitor activity |
GO:0030568 | 83% | plasmin inhibitor activity |
GO:0030304 | 83% | trypsin inhibitor activity |
GO:0030414 | 82% | peptidase inhibitor activity |
GO:0005488 | 79% | binding |
GO:0005515 | 74% | protein binding |
GO:0046872 | 73% | metal ion binding |
GO:0003677 | 71% | DNA binding |
GO:0008201 | 70% | heparin binding |
GO:0008270 | 69% | zinc ion binding |
GO:0005507 | 69% | copper ion binding |
GO:0005506 | 67% | iron ion binding |
Pfam
Source | Description | Entry type | Significant | GO aspect | GO description | GO id |
---|---|---|---|---|---|---|
Pfam-A | Amyloid A4 N-terminal heparin-binding | Domain | x | Cellular component | integral to membrane | GO:0016021 |
Pfam-A | Amyloid A4 N-terminal heparin-binding | Domain | x | Molecular function | binding | GO:0005488 |
Pfam-A | Copper-binding of amyloid precursor, CuBD | Domain | x | - | - | - |
Pfam-A | Kunitz/Bovine pancreatic trypsin inhibitor domain | Domain | x | Molecular function | serine-type endopeptidase inhibitor activity | GO:0004867 |
Pfam-A | E2 domain of amyloid precursor protein | Domain | x | - | - | - |
Pfam-A | Beta-amyloid peptide | Family | x | Cellular component | integral to membrane | GO:0016021 |
Pfam-A | Beta-amyloid peptide | Family | x | Molecular function | binding | GO:0005488 |
Pfam-A | beta-amyloid precursor protein C-terminus | Family | x | - | - | - |
Pfam-A | Exonuclease VII, large subunit | Family | - | - | - | |
Pfam-A | Transcriptional activator TraM | Family | - | - | - |
ProtFun
Gene Ontology category Prob Odds
Signal_transducer 0.126 0.586
Receptor 0.036 0.211
Hormone 0.001 0.206
Structural_protein => 0.034 1.205
Transporter 0.024 0.222
Ion_channel 0.009 0.162
Voltage-gated_ion_channel 0.002 0.108
Cation_channel 0.010 0.215
Transcription 0.043 0.335
Transcription_regulation 0.018 0.143
Stress_response 0.076 0.862
Immune_response 0.016 0.183
Growth_factor 0.005 0.372
Metal_ion_transport 0.009 0.020
Type | GO category | GO aspect | GO id |
---|---|---|---|
Highest information content | Structural protein | Molecular function | GO:0005198 |
Highest probability | Signal transducer | Molecular function | GO:0004871 |
Evaluation of the Results
We used the Gene Ontology annotation of the corresponding protein of the EBI website as the reference for the evaluation of the programs (see list of annotated GO ids). Afterwards we determined the true positives, false positives, true negatives and false negatives and calculated the sensitivity and specificity. For this, we created Venn diagrams which we provide on this page. A true negative was defined as a false positive of one of the other two programs. "Overall" is the evaluation of all six proteins. The highest value of sensitivity/specificity in respect to a certain protein is marked with a green background.
Protein | Program | True positives | False positives | True negatives | False negatives | Sensitivity | Specificity |
---|---|---|---|---|---|---|---|
GLA | GOPET | 4 | 1 | 1 | 18 | 0.18 | 0.5 |
Pfam | 2 | 0 | 2 | 20 | 0.09 | 1 | |
ProtFun | 0 | 1 | 1 | 22 | 0 | 0.5 | |
BACR_HALSA | GOPET | 1 | 2 | 1 | 11 | 0.08 | 0.33 |
Pfam | 3 | 0 | 3 | 9 | 0.25 | 1 | |
ProtFun | 0 | 1 | 2 | 12 | 0 | 0.67 | |
RET4_HUMAN | GOPET | 5 | 3 | 1 | 36 | 0.12 | 0.25 |
Pfam | 1 | 0 | 4 | 40 | 0.02 | 1 | |
ProtFun | 0 | 1 | 3 | 41 | 0 | 0.75 | |
INSL5_HUMAN | GOPET | 1 | 0 | 1 | 3 | 0.25 | 1 |
Pfam | 2 | 0 | 1 | 2 | 0.5 | 1 | |
ProtFun | 1 | 1 | 0 | 3 | 0.25 | 0 | |
LAMP1_HUMAN | GOPET | 0 | 2 | 2 | 17 | 0 | 0.5 |
Pfam | 1 | 0 | 4 | 16 | 0.06 | 1 | |
ProtFun | 0 | 2 | 2 | 17 | 0 | 0.5 | |
A4_HUMAN | GOPET | 7 | 6 | 2 | 71 | 0.09 | 0.25 |
Pfam | 3 | 0 | 8 | 75 | 0.04 | 1 | |
ProtFun | 0 | 2 | 6 | 78 | 0 | 0.75 | |
Overall | GOPET | 18 | 14 | 8 | 156 | 0.10 | 0.36 |
Pfam | 12 | 0 | 22 | 162 | 0.07 | 1 | |
ProtFun | 1 | 8 | 14 | 171 | 0 | 0.64 |
There are two things remarkable. First, Pfam does not have a single false positive prediction and hence the specificity is 1. This leads us to the conclusion that the search feature has also a very good specificity and that the annotation of the domains and families in Pfam is also very accurate in respect to Gene Ontology. Second, ProtFun achieved only one true positive and thus the sensitivity is close to 0. This can be explained due to the fact that ProtFun only predicts very general Gene Ontology categories (e.g. immune response, receptor, etc.) and therefore the prediction misses alot of subannotations. Nevertheless five out of six predictions of the main categories were also wrong and therefore we would not recommend to use ProtFun for a prediction of GO terms.
In respect to the sensitivity, GOPET achieves the highest value with 0.10 which is still very low. Since the overall sensitivity of Pfam is slighty lower (0.07) and the specificity of Pfam is 1, we would prefer Pfam for the prediction of GO term. Overall the sensitivity is very low and hence you have to keep in mind that you will miss a lot of GO terms when using one of these programs.
List of annotated GO ids
References
<references />