Difference between revisions of "Glucocerebrosidase sequence based prediction"

From Bioinformatikpedia
(Annotation of Glucocerebrosidase)
(Pfam)
Line 1,069: Line 1,069:
   
 
'''Results'''<br/>
 
'''Results'''<br/>
Pfam assigns glucocerebrosidase to the "O-Glycosyl hydrolase family 30". To retrieve the GO-annotations for this family, the pfam2go file <ref>http://www.geneontology.org/external2go/pfam2go</ref> of the Gene Ontology website had to be used. The first three GO-terms listed in Pfam are listed in AmiGO as well. The last term, lysosome, is a parent of the GO-term "lysosomal membrane" and is therefore correct as well.
+
Pfam assigns glucocerebrosidase to the "O-Glycosyl hydrolase family 30". To retrieve the GO-annotations for this family, the pfam2go file <ref>http://www.geneontology.org/external2go/pfam2go</ref> of the Gene Ontology website had to be used. The first three GO-terms listed in Pfam are listed in AmiGO as well. The last term, lysosome, is a parent of the GO-term "lysosomal membrane" and is therefore correct as well.
  +
The GO-terms in Pfam yield a precision of 1.0 and a recall of 0.74.
   
 
{| border="1" style="border-spacing:0;"
 
{| border="1" style="border-spacing:0;"

Revision as of 15:47, 31 May 2011

In this section several different seqeunce based predictions are applied to the sequence of glucocerebrosidase.


Secondary structure prediction

General

The secondary structure of a protein is the three-dimensional form. In contrast to the tertiary structure it describes the local segments. Because of weak chemical forces like hydrogen bonds and the values of the φ and ψ angles they form different structures. The main types are α-helices and parallel and anti-parallel β-sheets. Some rare structures are π-helices and 3,10-helices. Another possibility are coils, which are irregular formed elements.
A protein consists of several secondary structure elements which build together the tertiary structure. <ref>http://en.wikipedia.org/wiki/Biomolecular_structure#Secondary_structure</ref>

beta sheets and a α-helix as examples for secondary structure
Source: http://www.nature.com/horizon/proteinfolding/background/images/importance_f3.gif

PSIPRED

PSIPRED is a method by David T. Jones, published 1999 in JMB with "Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices". PSIPRED works with a two-stage neural network to predict secondary structure. These are based on the position specific scoring matrices generated by PSI-BLAST, which is run before.<ref>David T. Jones, Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices, JMB, 1999</ref>
As input only the protein sequence is needed.

We run the online and the local version of PSIPRED and got different results. In the following it is compared to the secondary structure given in Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>.

Conf: 988898954488887622315999999999998641038968865325999649995388
online: CCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEECCC
Conf: 987898955489988742200466888998986410038977877777863169974474
local: CCCCCCCCCCCCCCCCCEEEEHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEECCC
uniprot: ------------------------------------------------EEEE-EEEEEE-
AA: MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
 
Conf: 558889998889992599996377885421237645688875108995378301079247
online: CCCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCEEEEEEE
Conf: 148899998788875431100345640022100111177897107840966454557422
local: CCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEEECCCCCCCCEEEECCCCCCCEEEEE
uniprot: -------------EEEEEEEE-----EEEEEEE-EEE----EEEEEEEEEEEEEE--EEE
AA: YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
 
Conf: 300233899997249999999999860597882105999750588999986666899999
online: EECCCHHHHHHHHCCCHHHHHHHHHHCCCCCCCEEEEEEEEECCCCCCCCCCCCCCCCCC
Conf: 011335889987508927898998851396893001358621344677653324799999
local: CCCCCHHHHHHHHHCCHHHHHHHHHHHCCCCCCEEEEEEEECCCCCCCCCCCCCCCCCCC
uniprot: EE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE--
AA: GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
 
Conf: 689999994100245289999999971999389971377785612147247999889999
online: CCCCCCCCCHHCCCCCHHHHHHHHHHCCCCCEEEECCCCCCCCCEECCCCCCCCCCCCCC
Conf: 721111368543220024799998733999689957899974220056347854325899
local: CCCCCCCCCCCCCCCHHHHHHHHHHHCCCCCEEEECCCCCCCCCCCCCCCCCCCCCCCCC
uniprot: --------HHHH--HHHHHHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE---
AA: FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
 
Conf: 922699999999999999975490786872012579899999999986349999999999
online: CCHHHHHHHHHHHHHHHHHHHCCEEEEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHH
Conf: 971468799999999967663395143898112789787678873222114422121122
local: CCHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCHHHH
uniprot: -HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHHHHH
AA: GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
 
Conf: 955799851689972999944888873334664149955640224689831699998033
online: HHHHHHHHCCCCCCEEEEEECCCCCCHHHHHHHHCCCHHHHCCCCEEEEEECCCCCCHHH
Conf: 111332310577410134212544556520222238976651151878702212236320
local: HHHHHHHHCCCCCCCEEEEECCCCCCCCCCHHHHCCCHHHHHCCEEEEEECCCCCCCCCC
uniprot: -HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHH
AA: RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
 
Conf: 412688750999509994343699998866567831444255999999996402335772
online: HHHHHHHHCCCCCEEEEECCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCEEEEEE
Conf: 011111126898001101210389653344579861210143212566552001100000
local: CCCCCCCCCCCCCCEEHHHHCCCCCCCCCCCCCCCHHHHCCCCHHHHHHHHHHHHHHEEE
uniprot: HHHHHHHH---EEEEEEEEE--------------HHHHHHHHHHHHHHHH--EEEEEEEE
AA: ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
 
Conf: 000169999986689878535895679769986202333102244469939999542389
online: EECCCCCCCCCCCCCCCCCCEEEECCCCEEEECCHHHHHHHHCCCCCCCCEEEEEEECCC
Conf: 023699999860001325228998208703226821232123444679927984435078
local: CCCCCCCCCCCCEECCCCCCEEEEECCCCEEECCCEEEECCCCCCCCCCCEEEEEEEECC
uniprot: E-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--
AA: NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
 
Conf: 99028999928998999999099993779999099304998518951899999309
online: CCCEEEEEECCCCCEEEEEECCCCCCEEEEEEECCCCEEEEECCCCEEEEEEEEEC
Conf: 99538995649997799999146898214741998642000389842568774139
local: CCCCEEEEECCCCCEEEEEEECCCCCEEEEECCCCCCCCCCCCCCCEEEEEEEECC
uniprot: EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----
AA: NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

The results differ a lot. The online version of PSIPRED seems to have different parameters than the local version. So we got different results. Compared to the given secondary structure in Uniprot there are many regions that are predicted wrong.

Jpred3

Jpred3 was published 2008 by Christian Cole, Jonathan D. Barber and Geoffrey J. Barton as "The Jpred 3 secondary structure prediction server" in Nucl. Acids Res. The Jnet algorithm predicts the secondary structure and solvent accessibility with the help of alignment profiles. Therefore it uses the position-specific scoring matrix (PSSM) from PSI-BLAST and a hidden Markov model. The prediction is made with a neural network.<ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.full</ref>
As input only the protein sequence is needed. Alternatively you can also use a multiple sequence alignment.

OrigSeq MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
Jnet -----------------HHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
jhmm ----------------HHHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
jpssm ------------------HHHHHHHHHHHHHHHHHH---------------EEEEEE-----------------EEEEEEE------------------
uniprot ------------------------------------------------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE--
 
OrigSeq TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIP
Jnet ----EEEEEE----EEEEEEEEEEHHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
jhmm ----EEEEE-----EEEEEEEEEEEHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
jpssm ----EEEEEE----EEEEEEEE-HHHHHHHHHH----HHHHHHHHHHH------EEEEEEEE--------------------------------HHHHH
uniprot --EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE----------HHHH--HHHH
 
OrigSeq LIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRD
Jnet HHHHHHHH----EEEEE--------EE-------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
jhmm HHHHHHHH----EEEEE-----------------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
jpssm HHHHHHHHH---EEEEE--------EEE-----------------HHHHHHHHHHHHHHHHHH----EEEEEE--------------------HHHHHH
uniprot HHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHH
 
OrigSeq FIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSW
Jnet HHHHHHHHHHHH-----EEEEEE--------HHHHHHH--HHHHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
jhmm HHHHHHHHHHHH------EEEEE-------HHHHHHHH--H-HHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
jpssm HHHHHHHHHHHH-----EEEEEEE-------HHHHHH----HHHHHH--EE-----------HHHHHHHHHH-----EEEEEEE--------------H
uniprot HHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HH
 
OrigSeq DRGMQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSA
Jnet HHHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEEE-------EEEEEEE-----E
jhmm -HHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEE--------EEEEEEE-----E
jpssm HHHHHHHHHHHHHHHHHHHHHHHHHHE----------------EEEEE----EEEE--HHHHHHHH--------EEEEEE------EEEEEEEE----E
uniprot HHHHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--EEEEEEEE-----EE
 
OrigSeq VVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
Jnet EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
jhmm EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
jpssm EEEEEE----EEEEEEEE---EEEEEEE---EEEEEEE---
uniprot EEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----

Jpred3 predicts the bigger part of the protein correctly. Some parts are not predicted, where are helices or beta sheets and only one time it predicted sheets instead of helices.
For the prediction it used a lot of hits of Blast with an E-value of 0, and also one with an E-value of 2e-52:
2wkl, 3keh, 3ke0, 3gxm, 3gxi, 3gxf, 3gxd, 2wcg, 2vt0, 2v3f, 2v3e, 2v3d, 2nt1, 2nt0, 2nsx, 2j25, 2f61, 1y7v, 1ogs, 2wnw.
All but 2wnw are glucoceramidase proteins of Homo Sapiens, 1ogs is exactly that we use. 2wnw is a hydrolyse activated by transcription factor from salmonella typhimurium. It seems to be also glucoceramidase. Some examples of the used proteins are shown in the following pictures.

Structure of 2wkl
Source:http://www.ebi.ac.uk/pdbsum/2wkl
Structure of 3keh
Source: http://www.ebi.ac.uk/pdbsum/3keh
Structure of 1ogs
Source: http://www.ebi.ac.uk/pdbsum/1ogs
Structure of 2wnw
Source: http://www.ebi.ac.uk/pdbsum/2wnw

Comparison with DSSP

DSSP, which stands for Define Secondary Structure of Proteins, is by Wolfgang Kabsch and Chris Sander, who published it 1983 in Biopolymers with the title "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features".<ref>http://swift.cmbi.ru.nl/gv/dssp/</ref>
The DSSP algorithm recognises the secondary structure. Therefore it defines hydrogen bonds with an electrostatic definition. The different patterns of hydrogen bonds constitute one of eight possible secondary structure types. So it is no prediction tool.<ref>http://en.wikipedia.org/wiki/DSSP_%28protein%29</ref>

                    10        20        30        40        50        60
                     |         |         |         |         |         |
   1 -   60 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
   1 -   60      SSS TTTTSSSSSSTT            TTSSSSSSSSTTT  TSSSSSS  TT
   1 -   60                     *                                      *
   1 -   60 AAA AAAAAAAA        A     AA  A  A    A    AA A  AA   A AAAA
                    70        80        90       100       110       120
                     |         |         |         |         |         |
  61 -  120 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
  61 -  120    TSSSSSSSSSSSSS  SSSSS  HHHHHHHTTT HHHHHHHHHHHHTTTTT   SSS
  61 -  120   *    * *
  61 -  120 A A  A   A AAAA A             A  A  AAA  A   A    AA
                   130       140       150       160       170       180
                     |         |         |         |         |         |
 121 -  180 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
 121 -  180 SSST  TTTTT   T  TTT TT TT    HHHHTTHHHHHHHHHHH TT  SSSSSST
 121 -  180          **       *           **
 121 -  180          AAA    AAAA AA AA    A  AA   A  AA AAA A A
                   190       200       210       220       230       240
                     |         |         |         |         |         |
 181 -  240 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
 181 -  240   333 TT TTTTT   TT TTTHHHHHHHHHHHHHHHHHHHTT   TSSST TTTT333
 181 -  240        ***** *
 181 -  240       AAAAAA A   AAA  AAA A   A   A  A   AAA A             A
                   250       260       270       280       290       300
                     |         |         |         |         |         |
 241 -  300 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
 241 -  300 TTT  T      HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
 241 -  300  ** *                                        *
 241 -  300 AAAAA      A AA  A   A   A  AA A AA A      A A   A  A   A AA
                   310       320       330       340       350       360
                     |         |         |         |         |         |
 301 -  360 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
 301 -  360 HHTT  SSSSSSSTTT   HHHHHHHHHHH TTTSSSSSSSS    TTT T  TT HHHH
 301 -  360                **           *   *             **
 301 -  360   AA        A  AA A AA   A AA  AA            AAA A  A    A
                   370       380       390       400       410       420
                     |         |         |         |         |         |
 361 -  420 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
 361 -  420 HHHHHHHHHHHHTTSSSSSSSST   TTT   TT      TSSSS333TSSSS HHHHHH
 361 -  420                               *** ***        **
 361 -  420     A      AA            AAA    A A A        AAA
                   430       440       450       460       470       480
                     |         |         |         |         |         |
 421 -  480 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
 421 -  480 HHHHTT  TT SSSSSSSTT  TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
 421 -  480                 *****                               *     *
 421 -  480         A       AAAAAAA       AAA         A AAA A   A AA  A
                   490
                     |
 481 -  497 ETISPGYSIHTYLWHRQ
 481 -  497 SSSS TTSSSSSSS
 481 -  497                 *
 481 -  497 A A A          AA

            500       510       520       530       540       550
              |         |         |         |         |         |
 498 -  557 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
 498 -  557      SS  TTTT SSSS TT      T     TTSSSSSSSSTTT  TSSSSSS  T
 498 -  557   * *                           *  *             ****** *
 498 -  557 AAA AAAAAAAA        A    AAAAAA AA    A    AA A   A A A AAAA
            560       570       580       590       600       610
              |         |         |         |         |         |
 558 -  617 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
 558 -  617   TTSSSSSSSSSSSSS  SSSSS  HHHHHHHHTT HHHHHHHHHHHHTTTTT   SSS
 558 -  617                               *     *
 558 -  617 AAAA A   A AAAA A             A  AA AAA  A   A    A
            620       630       640       650       660       670
              |         |         |         |         |         |
 618 -  677 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
 618 -  677 SSST  TTTTT   T  TTT TT TT    HHHHTTHHHHHHHHHHH TT  SSSSSST
 618 -  677          ***    *             *
 618 -  677          AAA    AAAA  A AA A  A  AAA     AA AAA AAA
            680       690       700       710       720       730
              |         |         |         |         |         |
 678 -  737 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
 678 -  737   333 TT TTTTT   TT TTTHHHHHHHHHHHHHHHHHHHTT   TSSST TT33333
 678 -  737   *
 678 -  737   A   AAA  A A   AAA      A   A   A  A   AAA A             A
            740       750       760       770       780       790
              |         |         |         |         |         |
 738 -  797 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
 738 -  797 TTT  T      HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
 738 -  797   ***                                        **
 738 -  797 AAA A      A A  AA   A   A  AA A AA A      A A   A  A   A AA
            800       810       820       830       840       850
              |         |         |         |         |         |
 798 -  857 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
 798 -  857 HHTT  SSSSSSSTTT   HHHHHHHHHHH TTTSSSSSSSST  TTTT T  TT HHHH
 798 -  857                **   *       ** *              **
 798 -  857   A         A  AA A AA   A AAA AA            AAA A  A    A
            860       870       880       890       900       910
              |         |         |         |         |         |
 858 -  917 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
 858 -  917 HHHHHHHHHHHHTTSSSSSSSST   TTT   TT      TSSSS333TSSSS HHHHHH
 858 -  917                               *   **         **
 858 -  917     A      AA             AA    AAAA         AAA
            920       930       940       950       960       970
              |         |         |         |         |         |
 918 -  977 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
 918 -  977 HHHHTT  TT SSSSSSSTT  TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
 918 -  977
 918 -  977         A       AAAAAAA        AA         A AAA A   A A   A
            980       990
              |         |
 978 -  994 ETISPGYSIHTYLWHRQ
 978 -  994 SSSS TTSSSSSSS
 978 -  994
 978 -  994 A A            AA

The result of DSSP has four lines: The first line describes the amino acids, the second the secondary structure, the third residues involved in symmetry contacts, which are marked with an asterix (*) and the fourth solvent accessible residues, which are marked with an A.
The secondary structure has the elements H, T, 3 and S, where H indicates an alpha helix, T a hydrogen bonded turn, 3 a residue in an isolated beta-bridge and S a bend, which is a region of high curvature.
The structure differs from our GBA sequence, because it used the PDB-file, which contains the sequence without the signaling peptide at the beginning and both domains of the protein. In the following we compare the secondary structure of domain A by DSSP with the secondary structure in Uniprot.

uniprot ----------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE-
DSSP -----SSS-TTTTSSSSSSTT------------TTSSSSSSSSTTT--TSSSSSS--TT-
 
uniprot ---EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEE
DSSP TSSSSSSSSSSSSS--SSSSS--HHHHHHHTTT-HHHHHHHHHHHHTTTTT---SSS
 
uniprot EEEE--EEEEE------EEE----------HHHH--HHHHHHHHHHH-----EEEEEEE-
DSSP SSST--TTTTT---T--TTT-TT-TT----HHHHTTHHHHHHHHHHH-TT--SSSSSST-
 
uniprot --HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH
DSSP --333-TT-TTTTT---TT-TTTHHHHHHHHHHHHHHHHHHHTT---TSSST-TTTT333
 
uniprot ------------HHHHHHHHHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HH
DSSP TTT--T------HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
 
uniprot HH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HHHH
DSSP HHTT--SSSSSSSTTT---HHHHHHHHHHH-TTTSSSSSSSS----TTT-T--TT-HHHH
 
uniprot HHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHH
DSSP HHHHHHHHHHHHTTSSSSSSSST---TTT---TT------TSSSS333TSSSS-HHHHHH
 
uniprot HHHH-------EEEEEEEEE--EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEE
DSSP HHHHTT--TT-SSSSSSSTT--TSSSSSSS-TTT-SSSSSSS-TTT-SSSSSSSTTTSSS
 
uniprot EEEE---EEEEEEE---
DSSP SSSS-TTSSSSSSS---


The result with DSSP is good. Most helices are right and the beta sheets are determined mostly as bends. Sometimes the helices or beta sheets are mixed up with turns, but overall it fits quite well.

Prediction of disordered regions

General

Disordered regions are regions with no fixed secondary and therefore tertiary structure. The Ramachandran angles of these regions are very flexible, which means, they can have more than one secondary structure which often depends on the binding to a substrate.
There are two types of disordered regions, the extended or random coil like ones and the collapsed or molten gloguble like ones. They have a lot of different functions, for example they are responsible for the DNA/RNA/protein recognition or the specificity and affinity of the binding. They are often involved in regulatory functions.<ref>http://www.pondr.com/pondr-tut1.html</ref>

DISOPRED

DISOPRED was published by Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF and Jones DT in 2004 in the Journal of Molecular Biology with the title "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life".<ref>http://bioinf.cs.ucl.ac.uk/disopred/</ref>
The method is based on a neural network which was trained with the SVMlight support vector machine package. DISOPRED first uses PSI-BLAST with a filtered sequence database and uses the position-specific scoring matrix at the final iteration to generate inputs for DISOPRED.<ref>http://cms.cs.ucl.ac.uk/typo3/fileadmin/bioinf/Disopred/disopred_help.html</ref>

DISOPRED predictions for a false positive rate threshold of: 5%

conf 999999999988777630000000000000000000000000000000000000000000
pred ****************............................................
AA MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
conf 000000000000000000000001456676677776543210000000000000000000
pred .........................************.......................
AA YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
conf 000000000000000000000000000000000000000000000000000000000000
pred ............................................................
AA GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
conf 000000000000000000000000000000000000000000000000000021000000
pred ............................................................
AA FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
conf 000000000000000000000000000000000000000000000000000000000000
pred ............................................................
AA GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
conf 000000000000000000000000000000000000000000000000000000000000
pred ............................................................
AA RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
conf 000000000000000000000000000000000000000000000000000000000000
pred ............................................................
AA ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
conf 000000000000000000000000000000000000000000000000000000022331
pred ............................................................
AA NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
conf 00000000000000000000000000000000000000000000000000000004
pred ........................................................
AA NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Asterisks (*) represent disorder predictions and dots (.) prediction of order. The confidence estimates give a rough indication of the probability that each residue is disordered.

The interesting thing is, that the signale peptide is marked as a disordered region and has also a very high confidence. The second as disordered predicted region is part of a beta sheet. So there the prediction may be wrong.

POODLE

POODLE stands for Prediction Of Order and Disorder by machine LEarning. It is by S. Hirose, K. Shimizu, N. Inoue, S. Kanai and T. Noguchi and was published in 2008 in CASP8 Proceedings as "Disordered region prediction by integrating POODLE series".

It consists of three predictions:

  • short disorder regions prediction (POODLE-S: short disorder regions, missing region in X ray structure or high B-factor region)
  • long disorder regions prediction (POODLE-L: mainly longer than 40 consecutive amino acids)
  • unfolded protein prediction

POODLE-I is based on a work-flow approach. POODLE uses machine learning and only needs the amino acid sequence for prediction.<ref>http://mbs.cbrc.jp/poodle/help.html</ref>

POODLE-I

result of POODLE-I (POODLE series only)

Predicted as disordered:

no. AA ORD/DIS Prob.
1 M D 0.851
2 E D 0.813
3 F D 0.781
4 S D 0.753
5 S D 0.757
6 P D 0.813
7 S D 0.777
8 R D 0.743
9 E D 0.699
10 E D 0.707
11 C D 0.711
12 P D 0.656
13 K D 0.633
14 P D 0.609
15 L D 0.581
16 S D 0.553
17 R D 0.535
18 V D 0.51
...
95 I D 0.58
96 Q D 0.812
97 A D 0.892
98 N D 0.744
99 H D 0.551

The first as disordered predicted regions are part of the signal peptide. The second as disordered predicted regions are part of a beta sheet.

POODLE-S

result of POODLE-S (missing residues)
result of POODLE-S ()

Predicted as disordered:

no. AA  ORD/DIS/xray  Prob./xray  ORD/DIS/B-factor  Prob./B-factor 
1 M D 0.764 D 1
2 E D 0.707 D 0.901
3 F D 0.731 D 0.882
4 S D 0.759 D 0.875
5 S D 0.764 D 0.847
6 P D 0.75 D 0.825
7 S D 0.782 D 0.811
8 R D 0.751 D 0.791
9 E D 0.707 D 0.711
10 E D 0.692 D 0.661
11 C D 0.679 D 0.568
12 P D 0.66 D 0.596
13 K D 0.633 D 0.557
14 P D 0.607 D 0.517
15 L D 0.588 O 0.493
16 S D 0.55 O 0.489
17 R D 0.523 O 0.424
18 V D 0.534 O 0.391
...
100 T O 0.368 D 0.571
101 G O 0.29 D 0.54
102 T O 0.243 D 0.555
...
360 K D 0.527 0 0.133
361 A D 0.534 0 0.138
362 T D 0.577 0 0.162
363 L D 0.594 0 0.162
364 G D 0.513 0 0.163

POODLE-S also predicts a part of the signal peptide as disordered. It is based on looking for regions that are missing in the X-ray structure. Based on high B-factor region there are later also some disordered regions. The first one is not defined in Uniprot, the second one is part of a helix.

POODLE-L

result of POODLE-L

POODLE-L predicted no disordered regions. So there are no disordered regions that are longer than 40 amino acids.

IUPred

IUPred is a method by Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon which was published 2005 in Bioinformatics with the title "IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content".

The idea of IUPred is to estimate the ability of polypeptides to form stabilizing contacts. It depends on the surrounding sequence and its chemical properties. Intrinsically unstructured regions (IUs) cannot form such contacts. With a 20 by 20 energy predictor matrix the energies can be calculated and with that the probability of IUs.<ref>http://iupred.enzim.hu/Theory.html</ref>

For IUPred you can choose one of these three options<ref>http://iupred.enzim.hu/Help.html</ref>:

  • long disorder: predicts context-independent global disorder with at least 30 consecutive residues of predicted disorder (neighbourhood of 100 residues is considered)
  • short disorder: predicts short, probably context-dependent, disordered regions, e.g. missing residues in the X-ray structure (neighbourhood of 25 residues is considered)
  • structured domains: predicts putative structured domains

IUPred: short disordered regions

Diagram of IUPred for short disorders

Disordered regions:

Position  Residue  Disorder Tendency
1 M 0.9753
2 E 0.9369
3 F 0.9280
4 S 0.9009
5 S 0.8857
6 P 0.7869
7 S 0.7418
8 R 0.7034
9 E 0.5992
10 E 0.5549
...
85 G 0.5126
86 R 0.4458
87 R 0.5412
88 M 0.5173
89 E 0.5173
90 L 0.5992
91 S 0.5846
92 M 0.5900
93 G 0.5900
94 P 0.5900
95 I 0.5374
...
103 G 0.5173
104 L 0.5084
...
533 W 0.5514
534 R 0.5992
535 R 0.6124
536 Q 0.6474

IUPred for short disordered regions predicts for the beginning residues disordered regions with a high probability. The reason for that may be the signal peptide, which is there. It is hard to say, if the other regions are also disordered, because the probabilities are not really high. A disordered region of a length two is really short and therefore not very probable. At the end there is no secondary structure given at Uniprot, so maybe there is a disordered region.

IUPred: long disordered regions

Diagram of IUPred for long disorders

Disordered regions:

Position  Residue  Disorder Tendency
1 M 0.4864
2 E 0.5017
3 F 0.5707
...
87 R 0.4979
88 M 0.4979
89 E 0.4979
90 L 0.6136
91 S 0.5901
92 M 0.5992
93 G 0.5017
...
229 A 0.5055
230 V 0.5055
231 N 0.5211
...
235 S 0.5055
236 L 0.5139

IUPred for long disordered regions predicts some disordered regions but with a very low probability. They are also very short. So it is not very probable, that there are really disordered regions.

IUPred: structured regions

Diagram of IUPred for structured regions

IUPred for structured regions predicts globular domains for position 4 to 536. So all except the first four residues are predicted as structured.

META-Disorder

META-Disorder was published 2009 by Avner Schlessinger, Marco Punta, Guy Yachdav, Laszlo Kajan and Burkhard Rost with "Improved Disorder Prediction by Combination of Orthogonal Approaches" in PLoS ONE.<ref>http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0004433</ref>
It is a method which combines NORSnet<ref>https://www.rostlab.org/owiki/index.php/Norsnet</ref>, PROFbval<ref>https://rostlab.org/owiki/index.php/Profbval</ref> and Ucon<ref>https://www.rostlab.org/owiki/index.php/UCON</ref> to predict disordered regions. As input only the amino acid sequence is needed.<ref>https://www.rostlab.org/owiki/index.php/Metadisorder</ref>

Result of PredictProtein for disordered regions.

META-Disorder predicts for the first six residues disordered regions:

position residue score
1 M 0.636
2 E 0.596
3 F 0.596
4 S 0.586
5 S 0.561
6 P 0.551

The region, which is predicted to be disordered, is part of the signal peptide. If it is really a disordered region is hard to say, because the begin of a protein is often mispredicted as disordered.

Prediction of transmembrane alpha-helices and signal peptides

General

Transmembrane topology

The topology of a membrane protein is characterized by the number of membrane spanning segments in the protein. The transmembrane regions of the protein are hydrophobic and have a length of aproximately 15-30 residues which is enough to cross the lipid bilayer of the membrane once. The different transmembrane regions are connected by hydrophilic loops which are located outside the membrane. These attributes can be used to predict the transmembrane topology of a protein.
Predictors: TMHMM, OCTOPUS

Signal peptides

Signal Peptides are located at the N-terminus of a protein sequence and direct the transport of a protein to its correct location.
Predictors: SignalP, TargetP

TODO: Description of different signal peptides.

Combined transmembrane and signal peptide prediction

The high similarity between the hydrophobic region of a transmembrane helix and the one of a signal peptide leads to cross-predictions when conventional transmembrane topology and signal peptide predictors as TMHMM and SignalP are used. Predictors which are based on submodels for both make less errors coming from cross-predictions and help to discriminate against false positives. Furthermore, a predicted signal peptide indicates that the N-terminus of the protein is non-cytoplasmic and is therefore helpful to assign the orientation of the protein. <ref>Käll L. Krogh A, & Sonnhammer, E. L. (2007) Advantages of combined tranasmembrane topology and signal peptide prediction - the Phobius web server. Nucleic Acids Res., Vol. 35, Web server issue, S.429-32</ref>
Predictors: Phobius, Polyphobius, SPOCTOPUS.

TMHMM

TMHMM is a method to predict transmembrane topology of membrane-spanning proteins. It is based on a hidden Markoc model with an architecture of 7 types of states (helix core, helic caps on both sides, one loop on the cytoplasmic side, two loops on the non-cytoplasmic side and a globular domain in the middle of each loop) which correspond to the biological system. The method was established by Sonnhammer et al. in 1998 <ref>Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998;6:175–182.</ref>.

Results

TMHMM posterior probabilities for glucocerebrosidase

sp|P04062|GLCM_HUMAN Length: 536
sp|P04062|GLCM_HUMAN Number of predicted TMHs: 0
sp|P04062|GLCM_HUMAN Exp number of AAs in TMHs: 1.77867
sp|P04062|GLCM_HUMAN Exp number, first 60 AAs: 1.62607
sp|P04062|GLCM_HUMAN Total prob of N-in: 0.06840
sp|P04062|GLCM_HUMAN TMHMM2.0 outside 1 536

TMHMM predicts no transmembrane segments for the sequence of glucocerebrosidase which is correct.



Phobius and PolyPhobius

Phobius is based on a hidden Markoc model which contains submodels for both transmembrane helices and signal peptides and therefore obtains a better discrimination between the two segments than predictors for only one of them. This method was presented in 2004 by Käll et al. <ref>Käll L., et al. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 2004;338:1027–1036.</ref> Polyphobius is a pradiction server that additionally uses an algorithm to include homology information. The performance of transmembrane topology and signal peptide prediction is increased by incorporating extra support from homologs. <ref>Käll L., et al. An HMM posterior decoder for sequence feature prediction that includes homology information Bioinformatics, 21 (Suppl 1):i251-i257, June 2005.</ref>

Results - Phobius

Phobius posterior probabilities for glucocerebrosidase

ID sp|P04062|GLCM_HUMAN
FT SIGNAL 1 39
FT REGION 1 19 N-REGION.
FT REGION 20 31 H-REGION.
FT REGION 32 39 C-REGION.
FT TOPO_DOM 40 536 NON CYTOPLASMIC.

Phobius predics a signal peptide ranging from amino acid 1 to amino acid 39 of the sequence of glucocerebrosidase. This goes along with the information given on Uniprot <ref>http://www.uniprot.org/uniprot/P04062</ref> that the protein has a 39 residue signal sequence. The presence of a signal peptide explains the differences between the sequences of sp|P04062|GLCM_HUMAN and its corresponding PDB structure 1OGS: the sequence of 1OGS has 39 amino acids less than the sequence of sp|P04062|GLCM_HUMAN as the signal peptide is missing in the mature structure.

Results - PolyPhobius

Phobius posterior probabilities for glucocerebrosidase

ID sp|P04062|GLCM_HUMAN
FT SIGNAL 1 39
FT REGION 1 23 N-REGION.
FT REGION 24 34 H-REGION.
FT REGION 35 39 C-REGION.
FT TOPO_DOM 40 536 NON CYTOPLASMIC.

Polyphobius returns the same predictions as Phobius: a 39 residue signal sequence with slightly different region ranges.

OCTOPUS and SPOCTOPUS

OCTOPUS (obtainer of correct topologies for uncharacterized sequences) was developed in 2007 by Viklund et al. The method combines hidden markov models and artificial neural networks. Furthermore, OCTOPUS is the first method that integrates the modelation of reentrant-, membrane dip-, and TM hairpin regions. <ref>Viklund, H., and A. Elofsson. 2008. OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics 24:1662-1668.</ref> SPOCTOPUS is an extention of the OCTOPUS algorithm which additionally predictes signal peptides for reducing predictions of transmembrane regions as signal peptides and the other way round. The method was first mentiond by Viklund et al. in 2008. <ref>Viklund, H., et al. 2008. SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology Bioinformatics 24:2928-2929.</ref>

OCTOPUS - Results

Results of the topology prediction with Octopus

OCTOPUS predicts a transmembrane segment ranging from amino acid 16 to amino acid 36. The region from amino acid 1 to 15 is predicted to be cytoplasmic and the remaining amino acids (from amino acid 37) are indicated to be non cytoplasmic. This is an example for a misclassification between transmembrane segments and signal peptides when a prediction tool for only transmembrane segments is used.

SPOCTOPUS - Results

Results of the topology prediction with Spoctopus

SPOCTOPUS correctly identifies the regions from amino acid 1 to amino acid 39 as a signal peptide. The combination of a signal peptide and transmembrane segment prediction helps to eliminate the misclassification made by a transmembrane segment predictor (cf. results of OCTOPUS).

SignalP

SignalP uses a combination of several artificial neural networks and hidden Markov models to predict the presence and location of signal peptide cleavage sites of three different organism groups: eukaryotes, Gram-negative and Gram-positive bacteria. The current version 3.0 was published in 2004 by Bendtsen et al. <ref>Bendtsen JD, Nielsen H, von Heijne G, Brunak S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004;340:783–795.</ref>

Results

Prediction of signal peptide with SignalP 3.0 (image taken from the web server<ref>http://www.cbs.dtu.dk/services/SignalP/</ref>)

>sp|P04062|GLCM_HUMAN
Prediction: Signal peptide
Signal peptide probability: 0.516
Signal anchor probability: 0.001
Max cleavage site probability: 0.423 between pos. 39 and 40

SignalP predicts the signal peptide of glucocerebrosidase correctly. Even the cleavage site is located correctly between amino acid 39 and amino acid 40. The location of the different regions of the signal peptide can be seen in the illustration to the right.

TargetP

TargetP, a neural network-based tool, predicts the subcellular location of eukaryotic proteins based on the predicted presence of N-terminal presequences like mitochodraial targeting peptides, chloroplast trasit peptides and pathway signal peptides. For the latter two potential cleavage sites can be predicted as TargetP uses ChloroP and SignalP. This method, which only uses N-terminal sequence information, was published in 2000 by Emanuelsson et al. <ref>Emanuelsson O., et al. Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. J.Mol.Biol. (2000) 300, 1005-1016.</ref>

Results

Name Len mTP SP other Loc RC TPlen
sp_P04062_GLCM_HUMAN 536 0.091 0.364 0.612 _ 4 -
cutoff 0.000 0.000 0.000

TargetP does not predict the signal peptide of glucocerebrosidase. Although the score for a signal peptide is much higher than the one for a mitochondrial targeting peptide, it locates the protein with a very low reliability class rather at any other location than in chloroplast, mitochondrion or the secretory pathway.


Further Examples of Use

As the protein glucocerebrosidase does not contain any transmembrane segments, the application of the different tools mentioned above will further be demonstrated with some other proteins: BACR_HALSA<ref>http://www.uniprot.org/uniprot/P02945</ref>, RET4_HUMAN<ref>http://www.uniprot.org/uniprot/P02753</ref>, INSL5_HUMAN<ref>http://www.uniprot.org/uniprot/Q9Y5Q6</ref>, LAMP1_HUMAN<ref>http://www.uniprot.org/uniprot/P11279</ref> and A4_HUMAN<ref>http://www.uniprot.org/uniprot/P05067</ref>.

Topology

The table below indicates the correct number of transmembrane segments and presence or absence of a signal peptide in the five proteins mentioned above. A more detailed description with the exact positions of the membrane segments and the cleavage sites can be seen in the corresponding Uniprot entries.

Protein # Transmembrane Segments Signal Peptide
BACR_HALSA 7 0
RET4_HUMAN 0 1
INSL5_HUMAN 0 1
LAMP1_HUMAN 1 1
A4_HUMAN 1 1


TMHMM

The topology predictions of TMHMM for the different proteins are illustrated in the graphic below. TMHMM predicts 6 different transmembrane segments for BACR_HASLA, but indicates, that a signal peptide is possible. According to Uniprot, BACR_HASLA consists of 7 different transmembrane segments and no signal peptide. The last transmembrane helix, ranging from amino acid 217 to 236 was not predicted by TMHMM. The prediction that RET4_HUMAN and INSL5_Human do not contain any transmembrane segments goes along with the corresponding Uniprot entries. TMHMM finds 2 transmembrane segments in LAMP1_HUMAN. The first transmembrane segment is in reality a signal paptide, as indicated in the Uniprot entry and is therefore another example for the confusion of transmembrane segments and signal peptides. The single transmembrane helix of A4_HUMAN was predicted correctly.

Membrane topologies predicted with TMHMM for different proteins

Phobius and PolyPhobius

The results of Phobius and PolyPhobius are identical. There are only minor differences in the length of the different regions of a signal peptide. The topology predictions of both methods for the different proteins are illustrated in the graphic below. The predictions for all of the proteins were made correctly by Phobius and PolyPhobius. Sometimes the transmembrane segments are shifted some amino acid positions to the right or to the left. The ortientation of the proteins, the presence of signal peptides and the overall topology of the proteins go along with the corresponding Uniprot entries.

Membrane topologies predicted with Phobius and PolyPhobius for different proteins

OCTOPUS

OCTOPUS makes a lot of false predictions resulting from a confusion between signal peptides and transmembrane regions. Each signal peptide was predicted as either transmembrane segment or reentrant/dip region. BACR_HASLA, which does not have a signal peptide, was predicted correctly.

Membrane topologies predicted with OCTOPUS

SPOCTOPUS

SPOCTOPUS, which expends the OCTOPUS algorithm with a signal peptide prediction, predicts the overall topology, orientation of the protein and presenence of signal peptides in each case correctly. This is a very good example, that a combined signal peptide and transmembrane prediction is more reliable and makes less errors than a single transmembrane prediction.

Membrane topologies predicted with SPOCTOPUS

SignalP

If a signal peptide is present in the protein, SignalP predicts it with a very high confidence, both with the neural networks and the hidden Markov models. The presence of a signal peptide is predicted correctly for the proteins RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN and A4_HUMAN. The results of the neural network and the hidden Markov models differed for the protein BACT_HALSA which does not have a signal peptide. The S-Score of the neural networks indicates that there is a signal peptide and that the corresponding cleavage site is between position 38 and 39. In contrast, the hidden Marcov models predict a signal anchor.


Protein Signal Anchor Prob. Signal Peptide Prob. Cleavage Site Prob. Cleavage Site
BACR_HALSA 0.86 0.02 0.00 15-16
RET4_HUMAN 0.00 1.00 0.98 18-19
INSL5_HUMAN 0.00 1.00 0.91 22-23
LAMP1_HUMAN 0.00 1.00 0.85 28-29
A4_HUMAN 0.00 1.00 0.99 17-18


TargetP

TargetP predicts present signal peptides (for proteins RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN and A4_HUMAN) correctly and with a very high (1-2) reliability class in each case. The method predicts a signal peptide for BACR_HALSA as well, but the reliablility class is in this case very low (4), which indicates that the prediction is not very safe.

Name Len mTP SP other Loc RC TPlen
sp_P02945_BACR_HALSA 262 0.019 0.897 0.562 S 4 116
sp_P02753_RET4_HUMAN 201 0.242 0.928 0.020 S 2 18
sp_Q9Y5Q6_INSL5_HUMA 135 0.074 0.899 0.037 S 1 22
sp_P11279_LAMP1_HUMA 417 0.043 0.953 0.017 S 1 28
sp_P05067_A4_HUMAN 770 0.035 0.937 0.084 S 1 17
cutoff 0.000 0.000 0.000

Discussion

The application of the different tools for transmembrane region and signal peptide prediction to a variety of proteins shows that predictors which combine both elements are more reliable and make less errors coming from misclassifications than single predictors. Therefore, if possible, a predictor for both, transmembrane segment and signal peptide like SPOCTOPUS or Phobius, should be used.

Prediction of GO terms

General

The Gene Ontology Consortium tries to unify the terminology of gene and gene product attributes across all species to decrease non-consistent descriptions in different databases. It therefore has developed three different ontologies: cellular component, molecular function and biological process. <ref>http://www.geneontology.org/GO.doc.shtml</ref>

In this section the results of several prediction an annotation servers of GO-terms are compared. For each method, precision and recall are calculated by comparing the path of the GO-tree from the predicted terms back to the root with the path to the correct GO-terms.


Annotation of Glucocerebrosidase

The following GO-term annotations are taken from Uniprot<ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=P04062</ref> and are used as reference for a comparison with the different prediction tools.


Accession Term Ontology
GO:0005975 carbohydrate metabolic process biological process
GO:0008219 cell death biological process
GO:0006629 lipid metabolic process biological process
GO:0007040 lysosome organization biological process
GO:0006665 sphingolipid metabolic process biological process
GO:0008152 metabolic process biological process
GO:0005765 lysosomal membrane cellular component
GO:0016020 membrane cellular component
GO:0005764 lysosome cellular component
GO:0043169 cation binding molecular function
GO:0003824 catalytic activity molecular function
GO:0004348 glucosylceramidase activity molecular function
GO:0016798 hydrolase activity, acting on glycosyl bonds molecular function
GO:0016787 hydrolase activity molecular function
GO:0005515 protein binding molecular function

GOPET

GOPET, a Gene Ontology term Prediction and Evaluation Tool, uses homology searches and Support Vector Machines to predict the molecular function GO-terms for sequences of any organism. It was made public in 2006 by Vinayagam et al. <ref>Vinayagam A., et al. GOPET: A tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006; 7: 161.</ref>

Results

GOPET predicts 3 different GO-terms for glucocerebrosidase. Comparison with the reference GO-terms of glucocerebrosidase shows, that the annotations made by GOPET are correct: GO-ID 0004348 and GO-ID 0016798 are listed on the website and GO-ID 0016787, which is not explicitly named on AmiGO, is a parent of GO-ID 0016798 in the molecular function tree and therefore correct as well. GOPET does not predict two molecular function GO-terms: protein binding and cation binding. Taking only the GO-terms in account with the ontology "molecular function", the results of GOPET have a Precision of 1 and a Recall of 0.69.

GOid Aspect Confidence GO term Comparison to AmiGO
GO-ID:0016787 F 98% hydrolase activity parent of GO:0016798
GO-ID:0004348 F 97% glucosylceramidase activity listed
GO-ID:0016798 F 97% hydrolase activity acting on glycosyl bonds listed

Pfam

The database Pfam contains protein families and domains and is based on hidden Markov models. It consists of two parts: Pfam-A is curated and therefore contains high quality data whereas Pfam-B is generated automatically. Pfam was presented by Sonnhammer et al. in 1997. <ref>Sonnhammer E., et al., Pfam: A Comprehensive Database of Protein Domain Families Based on Seed Alignments. PROTEINS: Structure, Function, and Genetics 28:405-420(1997)</ref>

Results
Pfam assigns glucocerebrosidase to the "O-Glycosyl hydrolase family 30". To retrieve the GO-annotations for this family, the pfam2go file <ref>http://www.geneontology.org/external2go/pfam2go</ref> of the Gene Ontology website had to be used. The first three GO-terms listed in Pfam are listed in AmiGO as well. The last term, lysosome, is a parent of the GO-term "lysosomal membrane" and is therefore correct as well. The GO-terms in Pfam yield a precision of 1.0 and a recall of 0.74.

Accession Term Ontology Comparison to AmiGO
GO:0004348 glucosylceramidase activity molecular function listed
GO:0006665 sphingolipid metabolic process biological process listed
GO:0007040 lysosome organization biological process listed
GO:0005764 lysosome cellular component parent of GO:0005765

ProtFun 2.2

TODO

 Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.035    1.593
 Biosynthesis_of_cofactors            0.182    2.528
 Cell_envelope                     => 0.504    8.262
 Cellular_processes                   0.032    0.438
 Central_intermediary_metabolism      0.382    6.063
 Energy_metabolism                    0.067    0.740
 Fatty_acid_metabolism                0.027    2.088
 Purines_and_pyrimidines              0.538    2.213
 Regulatory_functions                 0.031    0.191
 Replication_and_transcription        0.126    0.471
 Translation                          0.082    1.863
 Transport_and_binding                0.560    1.365
 Enzyme/nonenzyme                     Prob     Odds
 Enzyme                            => 0.773    2.698
 Nonenzyme                            0.227    0.318
 Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.083    0.399
 Transferase    (EC 2.-.-.-)          0.228    0.660
 Hydrolase      (EC 3.-.-.-)          0.272    0.859
 Lyase          (EC 4.-.-.-)          0.045    0.961
 Isomerase      (EC 5.-.-.-)          0.011    0.345
 Ligase         (EC 6.-.-.-)          0.017    0.332
 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.054    0.251
 Receptor                             0.027    0.158
 Hormone                              0.001    0.206
 Structural_protein                   0.002    0.087
 Transporter                          0.024    0.222
 Ion_channel                          0.018    0.307
 Voltage-gated_ion_channel            0.004    0.195
 Cation_channel                       0.012    0.268
 Transcription                        0.070    0.550
 Transcription_regulation             0.030    0.237
 Stress_response                      0.085    0.962
 Immune_response                   => 0.153    1.804
 Growth_factor                        0.005    0.376
 Metal_ion_transport                  0.009    0.020


Further Examples of Use

BACR_HALSA RET4_HUMAN INSL5_HUMAN LAMP1_HUMAN A4_HUMAN
GO:0007602
GO:0006810
GO:0006811
GO:0018298
GO:0015992
GO:0050896
GO:0016021
GO:0005886
GO:0016020
GO:0005886
GO:0016020
GO:0016021

References

<references />