Glucocerebrosidase sequence based prediction
In this section several different sequence based predictions are applied to the sequence of glucocerebrosidase. For each prediction tool, the webserver and the required input is indicated. As not all results presented in this website were obtained by using the webservers, but instead by running the tools locally (e.g. PSIPRED, DISOPRED, TMHMM and SignalP), there might be differences in the results when trying to reproduce them.
Contents
- 1 Secondary structure prediction
- 2 Prediction of disordered regions
- 3 Prediction of coiled coils
- 4 Prediction of transmembrane alpha-helices and signal peptides
- 5 Prediction of GO terms
- 6 References
Secondary structure prediction
General
The secondary structure of a protein describes the local conformation of its polypeptide chain, which is limited by the peptide bond and hydrogen bonding considerations. Two types of secondary structures are dominating the local conformations of a polypeptide chain: alpha (α) helices and beta (β) sheets (cf. Figure 1). Some helix structures occuring less frequent are e.g. π-helices and 3,10-helices. Helices and sheets are stabilized by hydrogen bond interactions between the backbone atoms of the corresponding residues and are the only regular secondary structural elements present in proteins (cf. Figure 1). Loops or coils are examples for irregular structural elements. <ref>http://en.wikipedia.org/wiki/Biomolecular_structure#Secondary_structure</ref>
PSIPRED
PSIPRED is a method by David T. Jones, published 1999 in JMB with "Protein Secondary Structure Prediction Based on
Position-specific Scoring Matrices". PSIPRED works with a two-stage neural network that analyses the output of PSI-BLAST to predict the secondary structure of a protein. <ref>David T. Jones, Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices, JMB, 1999</ref>
Usage
- Webserver: http://bioinf.cs.ucl.ac.uk/psipred/
- Input: protein sequence in fasta format
Results
The online, as well as the local version of PSIPRED were applied to the sequence of glucocerebrosidase. Both runs resulted in different results which are compared to the secondary structure given in Uniprot. <ref>http://www.uniprot.org/uniprot/P04062</ref>. As one can see in the table below, the results differ a lot which may be traced back to the fact that different parameters were used in both versions. The comparison to the structure listed in Uniprot shows however, that the results of both versions differ in many regions considerably from the reference structure.
Conf: | 988898954488887622315999999999998641038968865325999649995388
|
online: | CCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEECCC
|
Conf: | 987898955489988742200466888998986410038977877777863169974474
|
local: | CCCCCCCCCCCCCCCCCEEEEHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEECCC
|
uniprot: | ------------------------------------------------EEEE-EEEEEE-
|
AA: | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
|
Conf: | 558889998889992599996377885421237645688875108995378301079247
|
online: | CCCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCEEEEEEE
|
Conf: | 148899998788875431100345640022100111177897107840966454557422
|
local: | CCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEEECCCCCCCCEEEECCCCCCCEEEEE
|
uniprot: | -------------EEEEEEEE-----EEEEEEE-EEE----EEEEEEEEEEEEEE--EEE
|
AA: | YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
|
Conf: | 300233899997249999999999860597882105999750588999986666899999
|
online: | EECCCHHHHHHHHCCCHHHHHHHHHHCCCCCCCEEEEEEEEECCCCCCCCCCCCCCCCCC
|
Conf: | 011335889987508927898998851396893001358621344677653324799999
|
local: | CCCCCHHHHHHHHHCCHHHHHHHHHHHCCCCCCEEEEEEEECCCCCCCCCCCCCCCCCCC
|
uniprot: | EE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE--
|
AA: | GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
|
Conf: | 689999994100245289999999971999389971377785612147247999889999
|
online: | CCCCCCCCCHHCCCCCHHHHHHHHHHCCCCCEEEECCCCCCCCCEECCCCCCCCCCCCCC
|
Conf: | 721111368543220024799998733999689957899974220056347854325899
|
local: | CCCCCCCCCCCCCCCHHHHHHHHHHHCCCCCEEEECCCCCCCCCCCCCCCCCCCCCCCCC
|
uniprot: | --------HHHH--HHHHHHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE---
|
AA: | FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
|
Conf: | 922699999999999999975490786872012579899999999986349999999999
|
online: | CCHHHHHHHHHHHHHHHHHHHCCEEEEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHH
|
Conf: | 971468799999999967663395143898112789787678873222114422121122
|
local: | CCHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCHHHH
|
uniprot: | -HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHHHHH
|
AA: | GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
|
Conf: | 955799851689972999944888873334664149955640224689831699998033
|
online: | HHHHHHHHCCCCCCEEEEEECCCCCCHHHHHHHHCCCHHHHCCCCEEEEEECCCCCCHHH
|
Conf: | 111332310577410134212544556520222238976651151878702212236320
|
local: | HHHHHHHHCCCCCCCEEEEECCCCCCCCCCHHHHCCCHHHHHCCEEEEEECCCCCCCCCC
|
uniprot: | -HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHH
|
AA: | RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
|
Conf: | 412688750999509994343699998866567831444255999999996402335772
|
online: | HHHHHHHHCCCCCEEEEECCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCEEEEEE
|
Conf: | 011111126898001101210389653344579861210143212566552001100000
|
local: | CCCCCCCCCCCCCCEEHHHHCCCCCCCCCCCCCCCHHHHCCCCHHHHHHHHHHHHHHEEE
|
uniprot: | HHHHHHHH---EEEEEEEEE--------------HHHHHHHHHHHHHHHH--EEEEEEEE
|
AA: | ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
|
Conf: | 000169999986689878535895679769986202333102244469939999542389
|
online: | EECCCCCCCCCCCCCCCCCCEEEECCCCEEEECCHHHHHHHHCCCCCCCCEEEEEEECCC
|
Conf: | 023699999860001325228998208703226821232123444679927984435078
|
local: | CCCCCCCCCCCCEECCCCCCEEEEECCCCEEECCCEEEECCCCCCCCCCCEEEEEEEECC
|
uniprot: | E-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--
|
AA: | NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
|
Conf: | 99028999928998999999099993779999099304998518951899999309
|
online: | CCCEEEEEECCCCCEEEEEECCCCCCEEEEEEECCCCEEEEECCCCEEEEEEEEEC
|
Conf: | 99538995649997799999146898214741998642000389842568774139
|
local: | CCCCEEEEECCCCCEEEEEEECCCCCEEEEECCCCCCCCCCCCCCCEEEEEEEECC
|
uniprot: | EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----
|
AA: | NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
Jpred3
Jpred3 was published in 2008 by Christian Cole, Jonathan D. Barber and Geoffrey J. Barton as "The Jpred 3 secondary structure prediction server" in Nucl. Acids Res. The Jnet algorithm predicts the secondary structure and the solvent accessibility of a protein with the help of alignment profiles. Therefore it uses the position-specific scoring matrix (PSSM) created by PSI-BLAST and a hidden Markov model. The final prediction is made with a neural network.<ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.full</ref>
As input, either the protein sequence or a multiple sequence alignment is taken. We decided to take the protein sequence.
Usage
- Webserver: http://www.compbio.dundee.ac.uk/www-jpred/index.html
- Input: protein sequence in fasta format
Results
Jpred3 predicts the majority of the secondary structure elements correctly: Sheets and helices get mixed up rarely. The same applies to both secondary structures being overlooked and certain regions being assigned to a certain structure by mistake. For the prediction a lot of BLAST-Hits with an E-value of 0 and one Hit with an E-Value of 2e-52 were used: 2wkl, 3keh, 3ke0, 3gxm, 3gxi, 3gxf, 3gxd, 2wcg, 2vt0, 2v3f, 2v3e, 2v3d, 2nt1, 2nt0, 2nsx, 2j25, 2f61, 1y7v, 1ogs (self-hit), 2wnw. Of these proteins, all but 2wnw are glucocerebrosidase proteins of Homo Sapiens. The latter is a hydrolase activated by a transcription factor from salmonella typhimurium, which seems to be a glucocerebrosidase as well. Due to the high identity the prediction is very good. Some examples of the used proteins are shown in Figures 2 to 5.
OrigSeq | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
|
Jnet | -----------------HHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
|
jhmm | ----------------HHHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
|
jpssm | ------------------HHHHHHHHHHHHHHHHHH---------------EEEEEE-----------------EEEEEEE------------------
|
uniprot | ------------------------------------------------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE--
|
OrigSeq | TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIP
|
Jnet | ----EEEEEE----EEEEEEEEEEHHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
|
jhmm | ----EEEEE-----EEEEEEEEEEEHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
|
jpssm | ----EEEEEE----EEEEEEEE-HHHHHHHHHH----HHHHHHHHHHH------EEEEEEEE--------------------------------HHHHH
|
uniprot | --EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE----------HHHH--HHHH
|
OrigSeq | LIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRD
|
Jnet | HHHHHHHH----EEEEE--------EE-------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
|
jhmm | HHHHHHHH----EEEEE-----------------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
|
jpssm | HHHHHHHHH---EEEEE--------EEE-----------------HHHHHHHHHHHHHHHHHH----EEEEEE--------------------HHHHHH
|
uniprot | HHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHH
|
OrigSeq | FIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSW
|
Jnet | HHHHHHHHHHHH-----EEEEEE--------HHHHHHH--HHHHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
|
jhmm | HHHHHHHHHHHH------EEEEE-------HHHHHHHH--H-HHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
|
jpssm | HHHHHHHHHHHH-----EEEEEEE-------HHHHHH----HHHHHH--EE-----------HHHHHHHHHH-----EEEEEEE--------------H
|
uniprot | HHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HH
|
OrigSeq | DRGMQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSA
|
Jnet | HHHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEEE-------EEEEEEE-----E
|
jhmm | -HHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEE--------EEEEEEE-----E
|
jpssm | HHHHHHHHHHHHHHHHHHHHHHHHHHE----------------EEEEE----EEEE--HHHHHHHH--------EEEEEE------EEEEEEEE----E
|
uniprot | HHHHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--EEEEEEEE-----EE
|
OrigSeq | VVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
Jnet | EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
|
jhmm | EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
|
jpssm | EEEEEE----EEEEEEEE---EEEEEEE---EEEEEEE---
|
uniprot | EEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----
|
Comparison with DSSP
DSSP, which stands for Define Secondary Structure of Proteins, was published by Wolfgang Kabsch and Chris Sander in 1983 with the title "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features".<ref>http://swift.cmbi.ru.nl/gv/dssp/</ref>
The DSSP algorithm recognises the secondary structure of a protein by defining hydrogen bonds with an electrostatic definition. The different patterns of these hydrogen bonds constitute one of the eight possible secondary structure types. DSSP is therefore a secondary structure assignment tool, rather than a prediction tool. <ref>http://en.wikipedia.org/wiki/DSSP_%28protein%29</ref>
Usage
- Webserver: http://swift.cmbi.ru.nl/servers/html/ (Protein Analysis --> Secondary Structure, symmetry and accessibility)
- Input: pdb-File
Results
The result of DSSP consists of four lines: The first line describes the amino acid sequence, the second line the secondary structure, the third the residues which are involved in symmetry contacts (marked with an asterix (*)) and the fourth solvent accessible residues, which are marked with an A. The secondary structure consists of the elements H, T, 3 and S, where H indicates an alpha helix, T an hydrogen bonded turn, 3 a residue in an isolated beta-bridge and S a bend, which is a region of high curvature.
10 20 30 40 50 60
| | | | | |
1 - 60 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
1 - 60 SSS TTTTSSSSSSTT TTSSSSSSSSTTT TSSSSSS TT
1 - 60 * *
1 - 60 AAA AAAAAAAA A AA A A A AA A AA A AAAA
70 80 90 100 110 120
| | | | | |
61 - 120 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
61 - 120 TSSSSSSSSSSSSS SSSSS HHHHHHHTTT HHHHHHHHHHHHTTTTT SSS
61 - 120 * * *
61 - 120 A A A A AAAA A A A AAA A A AA
130 140 150 160 170 180
| | | | | |
121 - 180 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
121 - 180 SSST TTTTT T TTT TT TT HHHHTTHHHHHHHHHHH TT SSSSSST
121 - 180 ** * **
121 - 180 AAA AAAA AA AA A AA A AA AAA A A
190 200 210 220 230 240
| | | | | |
181 - 240 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
181 - 240 333 TT TTTTT TT TTTHHHHHHHHHHHHHHHHHHHTT TSSST TTTT333
181 - 240 ***** *
181 - 240 AAAAAA A AAA AAA A A A A AAA A A
250 260 270 280 290 300
| | | | | |
241 - 300 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
241 - 300 TTT T HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
241 - 300 ** * *
241 - 300 AAAAA A AA A A A AA A AA A A A A A A AA
310 320 330 340 350 360
| | | | | |
301 - 360 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
301 - 360 HHTT SSSSSSSTTT HHHHHHHHHHH TTTSSSSSSSS TTT T TT HHHH
301 - 360 ** * * **
301 - 360 AA A AA A AA A AA AA AAA A A A
370 380 390 400 410 420
| | | | | |
361 - 420 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
361 - 420 HHHHHHHHHHHHTTSSSSSSSST TTT TT TSSSS333TSSSS HHHHHH
361 - 420 *** *** **
361 - 420 A AA AAA A A A AAA
430 440 450 460 470 480
| | | | | |
421 - 480 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
421 - 480 HHHHTT TT SSSSSSSTT TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
421 - 480 ***** * *
421 - 480 A AAAAAAA AAA A AAA A A AA A
490
|
481 - 497 ETISPGYSIHTYLWHRQ
481 - 497 SSSS TTSSSSSSS
481 - 497 *
481 - 497 A A A AA
500 510 520 530 540 550
| | | | | |
498 - 557 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
498 - 557 SS TTTT SSSS TT T TTSSSSSSSSTTT TSSSSSS T
498 - 557 * * * * ****** *
498 - 557 AAA AAAAAAAA A AAAAAA AA A AA A A A A AAAA
560 570 580 590 600 610
| | | | | |
558 - 617 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
558 - 617 TTSSSSSSSSSSSSS SSSSS HHHHHHHHTT HHHHHHHHHHHHTTTTT SSS
558 - 617 * *
558 - 617 AAAA A A AAAA A A AA AAA A A A
620 630 640 650 660 670
| | | | | |
618 - 677 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
618 - 677 SSST TTTTT T TTT TT TT HHHHTTHHHHHHHHHHH TT SSSSSST
618 - 677 *** * *
618 - 677 AAA AAAA A AA A A AAA AA AAA AAA
680 690 700 710 720 730
| | | | | |
678 - 737 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
678 - 737 333 TT TTTTT TT TTTHHHHHHHHHHHHHHHHHHHTT TSSST TT33333
678 - 737 *
678 - 737 A AAA A A AAA A A A A AAA A A
740 750 760 770 780 790
| | | | | |
738 - 797 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
738 - 797 TTT T HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
738 - 797 *** **
738 - 797 AAA A A A AA A A AA A AA A A A A A A AA
800 810 820 830 840 850
| | | | | |
798 - 857 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
798 - 857 HHTT SSSSSSSTTT HHHHHHHHHHH TTTSSSSSSSST TTTT T TT HHHH
798 - 857 ** * ** * **
798 - 857 A A AA A AA A AAA AA AAA A A A
860 870 880 890 900 910
| | | | | |
858 - 917 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
858 - 917 HHHHHHHHHHHHTTSSSSSSSST TTT TT TSSSS333TSSSS HHHHHH
858 - 917 * ** **
858 - 917 A AA AA AAAA AAA
920 930 940 950 960 970
| | | | | |
918 - 977 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
918 - 977 HHHHTT TT SSSSSSSTT TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
918 - 977
918 - 977 A AAAAAAA AA A AAA A A A A
980 990
| |
978 - 994 ETISPGYSIHTYLWHRQ
978 - 994 SSSS TTSSSSSSS
978 - 994
978 - 994 A A AA
The sequence differs from the one of the GBA sequence, as DSSP uses the PDB-file (1ogs), containing the sequence without the signaling peptide and both chains of the protein, as input. In the following table, the secondary structure of chain A, as assigned with DSSP, is compared to the secondary structure listed in Uniprot. It can be seen, that the result of DSSP is quite good. Most helices are assigned correctly and the beta sheets are determined most of the time as bends. Only rarely, helices and sheets get mixed up with turns.
uniprot | ----------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE-
|
DSSP | -----SSS-TTTTSSSSSSTT------------TTSSSSSSSSTTT--TSSSSSS--TT-
|
uniprot | ---EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEE
|
DSSP | TSSSSSSSSSSSSS--SSSSS--HHHHHHHTTT-HHHHHHHHHHHHTTTTT---SSS
|
uniprot | EEEE--EEEEE------EEE----------HHHH--HHHHHHHHHHH-----EEEEEEE-
|
DSSP | SSST--TTTTT---T--TTT-TT-TT----HHHHTTHHHHHHHHHHH-TT--SSSSSST-
|
uniprot | --HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH
|
DSSP | --333-TT-TTTTT---TT-TTTHHHHHHHHHHHHHHHHHHHTT---TSSST-TTTT333
|
uniprot | ------------HHHHHHHHHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HH
|
DSSP | TTT--T------HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
|
uniprot | HH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HHHH
|
DSSP | HHTT--SSSSSSSTTT---HHHHHHHHHHH-TTTSSSSSSSS----TTT-T--TT-HHHH
|
uniprot | HHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHH
|
DSSP | HHHHHHHHHHHHTTSSSSSSSST---TTT---TT------TSSSS333TSSSS-HHHHHH
|
uniprot | HHHH-------EEEEEEEEE--EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEE
|
DSSP | HHHHTT--TT-SSSSSSSTT--TSSSSSSS-TTT-SSSSSSS-TTT-SSSSSSSTTTSSS
|
uniprot | EEEE---EEEEEEE---
|
DSSP | SSSS-TTSSSSSSS---
|
Discussion
If we compare PSIPRED, Jpred3 and DSSP you can see, that the last two tools show the best results. The reason for the good result of Jpred3 may be, that it uses a lot of sequences with an E-value of 0. These are examples, where the structure is the same. So the prediction must be good. DSSP also shows a very good result which is highly consistent with the sequence in Uniprot. So the method works quite well.
Prediction of disordered regions
General
Disordered regions are regions with no fixed secondary and therefore no fixed tertiary structure. The Ramachandran angles of these regions are very flexible, which indicates, that they can have several different secondary structures, often depending on the binding to a substrate.
There are two types of disordered regions: the extended or random coil like ones and the collapsed or molten globule like ones. They have a lot of different functions, for example they are responsible for the DNA/RNA/protein recognition or the specificity and affinity of the binding. They are often involved in regulatory functions.<ref>http://www.pondr.com/pondr-tut1.html</ref>
DISOPRED
DISOPRED was published by Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF and Jones DT in 2004 in the Journal of Molecular Biology with the title "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life".<ref>http://bioinf.cs.ucl.ac.uk/disopred/</ref>
The method is based on a neural network which was trained with the SVMlight support vector machine package. DISOPRED first uses PSI-BLAST with a filtered sequence database and uses the position-specific scoring matrix at the final iteration to generate inputs for DISOPRED.<ref>http://cms.cs.ucl.ac.uk/typo3/fileadmin/bioinf/Disopred/disopred_help.html</ref>
Usage
- Webserver: http://bioinf.cs.ucl.ac.uk/disopred/
- protein sequence in fasta format, false positive threshold
Results
DISOPRED predictions for a false positive rate threshold of 5% are shown in the table below. Asterisks (*) represent disorder predictions and dots (.) predictions of order. The estimated confidence gives a rough indication of the probability that each residue is disordered.
The signal peptide is marked as a disordered region with a very high confidence which is quite interesting. In addition, a part of a beta sheet was predicted as a disordered regions which may indicate that the prediction might be wrong at least in this part.
conf | 999999999988777630000000000000000000000000000000000000000000
|
pred | ****************............................................
|
AA | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
|
conf | 000000000000000000000001456676677776543210000000000000000000
|
pred | .........................************.......................
|
AA | YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
|
conf | 000000000000000000000000000000000000000000000000000021000000
|
pred | ............................................................
|
AA | FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
|
conf | 000000000000000000000000000000000000000000000000000000022331
|
pred | ............................................................
|
AA | NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
|
conf | 00000000000000000000000000000000000000000000000000000004
|
pred | ........................................................
|
AA | NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
POODLE
POODLE stands for Prediction Of Order and Disorder by machine LEarning. It is by S. Hirose, K. Shimizu, N. Inoue, S. Kanai and T. Noguchi and was published in 2008 in CASP8 Proceedings as "Disordered region prediction by integrating POODLE series".
It consists of three predictions:
- short disorder regions prediction (POODLE-S: short disorder regions, missing region in X ray structure or high B-factor region)
- long disorder regions prediction (POODLE-L: mainly longer than 40 consecutive amino acids)
- unfolded protein prediction
POODLE-I is based on a work-flow approach. POODLE uses a machine learning approach and only needs the amino acid sequence for prediction.<ref>http://mbs.cbrc.jp/poodle/help.html</ref>
Usage
- Webserver: http://mbs.cbrc.jp/poodle/poodle.html
- Input: protein sequence in fasta format
POODLE-I
The amino acids predicted as disordered are listed in the table below. Furthermore a visualization of the disorder probability of the amino acids is shown in Figure 6 to the right. Two regions have been predicted as disordered: the first region once again is part of the signal peptide and the second region is part of a beta sheet.
no. | AA | ORD/DIS | Prob. |
1 | M | D | 0.851 |
2 | E | D | 0.813 |
3 | F | D | 0.781 |
4 | S | D | 0.753 |
5 | S | D | 0.757 |
6 | P | D | 0.813 |
7 | S | D | 0.777 |
8 | R | D | 0.743 |
9 | E | D | 0.699 |
10 | E | D | 0.707 |
11 | C | D | 0.711 |
12 | P | D | 0.656 |
13 | K | D | 0.633 |
14 | P | D | 0.609 |
15 | L | D | 0.581 |
16 | S | D | 0.553 |
17 | R | D | 0.535 |
18 | V | D | 0.51 |
... | |||
95 | I | D | 0.58 |
96 | Q | D | 0.812 |
97 | A | D | 0.892 |
98 | N | D | 0.744 |
99 | H | D | 0.551 |
POODLE-S
The residues predicted as disordered are listed in the table below, where D indicates a disordered region and 0 and ordered region. In the figures (Figure 7 and 8) to the right, the different disorder probabilities are illustrated. POODLE-S predicts a part of the signal peptide as disordered as well. The prediction is based on looking for regions that are missing in the X-ray structure. Based on a high B-factor region some disordered regions are found in the rear part of the protein. The first one is located in no defined secondary structure element in Uniprot, the second one is part of a helix.
no. | AA | ORD/DIS/xray | Prob./xray | ORD/DIS/B-factor | Prob./B-factor |
1 | M | D | 0.764 | D | 1 |
2 | E | D | 0.707 | D | 0.901 |
3 | F | D | 0.731 | D | 0.882 |
4 | S | D | 0.759 | D | 0.875 |
5 | S | D | 0.764 | D | 0.847 |
6 | P | D | 0.75 | D | 0.825 |
7 | S | D | 0.782 | D | 0.811 |
8 | R | D | 0.751 | D | 0.791 |
9 | E | D | 0.707 | D | 0.711 |
10 | E | D | 0.692 | D | 0.661 |
11 | C | D | 0.679 | D | 0.568 |
12 | P | D | 0.66 | D | 0.596 |
13 | K | D | 0.633 | D | 0.557 |
14 | P | D | 0.607 | D | 0.517 |
15 | L | D | 0.588 | O | 0.493 |
16 | S | D | 0.55 | O | 0.489 |
17 | R | D | 0.523 | O | 0.424 |
18 | V | D | 0.534 | O | 0.391 |
... | |||||
100 | T | O | 0.368 | D | 0.571 |
101 | G | O | 0.29 | D | 0.54 |
102 | T | O | 0.243 | D | 0.555 |
... | |||||
360 | K | D | 0.527 | 0 | 0.133 |
361 | A | D | 0.534 | 0 | 0.138 |
362 | T | D | 0.577 | 0 | 0.162 |
363 | L | D | 0.594 | 0 | 0.162 |
364 | G | D | 0.513 | 0 | 0.163 |
POODLE-L
POODLE-L did not predict any disordered regions (cf. Figure 9), which indicates, that there are no disordered regions with a length of at least 40 residues in the protein.
IUPred
IUPred is a method established by Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon which was published 2005 in Bioinformatics with the title "IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content".
The idea of IUPred is to estimate the ability of polypeptides to form stabilizing contacts. It depends on the surrounding sequence and its chemical properties. Intrinsically unstructured regions (IUs) cannot form such contacts. With a 20 by 20 energy predictor matrix the energies can be calculated and with that the probability of IUs.<ref>http://iupred.enzim.hu/Theory.html</ref>
For IUPred one of the three following options can be chosen<ref>http://iupred.enzim.hu/Help.html</ref>:
- long disorder: predicts context-independent global disorder with at least 30 consecutive residues of predicted disorder (neighbourhood of 100 residues is considered)
- short disorder: predicts short, probably context-dependent, disordered regions, e.g. missing residues in the X-ray structure (neighbourhood of 25 residues is considered)
- structured domains: predicts putative structured domains
Usage
- Webserver: http://iupred.enzim.hu/index.html
- protein sequence in fasta format, prediction type
IUPred: short disordered regions
The results of IUPred for short disordered regions are listed in the table below and illustrated in Figure 10 to the right. The tool predicts disordered regions with a high probability at the beginning of the sequence. This may be explained by the presence of a signal peptide at this position. It is hard to say, if the other regions are disordered as well, because the probabilities are not really high. A disordered region of a length of two residues is really short and therefore not very probable. At the end there is no secondary structure given at Uniprot, which indicates that a disordered region could be present.
Position | Residue | Disorder Tendency |
1 | M | 0.9753 |
2 | E | 0.9369 |
3 | F | 0.9280 |
4 | S | 0.9009 |
5 | S | 0.8857 |
6 | P | 0.7869 |
7 | S | 0.7418 |
8 | R | 0.7034 |
9 | E | 0.5992 |
10 | E | 0.5549 |
... | ||
85 | G | 0.5126 |
86 | R | 0.4458 |
87 | R | 0.5412 |
88 | M | 0.5173 |
89 | E | 0.5173 |
90 | L | 0.5992 |
91 | S | 0.5846 |
92 | M | 0.5900 |
93 | G | 0.5900 |
94 | P | 0.5900 |
95 | I | 0.5374 |
... | ||
103 | G | 0.5173 |
104 | L | 0.5084 |
... | ||
533 | W | 0.5514 |
534 | R | 0.5992 |
535 | R | 0.6124 |
536 | Q | 0.6474 |
IUPred: long disordered regions
The results of IUPred for long disordered regions are listed in the table below and are illustrated in Figure 11 to the right. It predicts some very short disordered regions with a very low probability. Thus it is not very probable, that there are really disordered regions.
Position | Residue | Disorder Tendency |
1 | M | 0.4864 |
2 | E | 0.5017 |
3 | F | 0.5707 |
... | ||
87 | R | 0.4979 |
88 | M | 0.4979 |
89 | E | 0.4979 |
90 | L | 0.6136 |
91 | S | 0.5901 |
92 | M | 0.5992 |
93 | G | 0.5017 |
... | ||
229 | A | 0.5055 |
230 | V | 0.5055 |
231 | N | 0.5211 |
... | ||
235 | S | 0.5055 |
236 | L | 0.5139 |
IUPred: structured regions
IUPred for structured regions predicts globular domains for position 4 to 536. So all except the first four residues are predicted as structured. An illustration of the prediction can be seen in Figure 12.
META-Disorder
META-Disorder was published in 2009 by Avner Schlessinger, Marco Punta, Guy Yachdav, Laszlo Kajan and Burkhard Rost in PLoS ONE.<ref>http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0004433</ref> It is a method which combines NORSnet<ref>https://www.rostlab.org/owiki/index.php/Norsnet</ref>, PROFbval<ref>https://rostlab.org/owiki/index.php/Profbval</ref> and Ucon<ref>https://www.rostlab.org/owiki/index.php/UCON</ref> to predict disordered regions. As input only the amino acid sequence is needed.<ref>https://www.rostlab.org/owiki/index.php/Metadisorder</ref>
Usage
- Webserver: http://www.predictprotein.org/ (registration needed)
- Input: protein sequence in fasta format
Results
The results of META-Disorder are shown in the table below and in Figure 13 to the right. The tool predicts a disordered region ranging from amino acid 1 to amino acid 6, which is part of the signal peptide. It is hard to say whether it is really a disordered region, as the beginning of a sequence is often mispredicted as a disordered region.
position | residue | score |
1 | M | 0.636 |
2 | E | 0.596 |
3 | F | 0.596 |
4 | S | 0.586 |
5 | S | 0.561 |
6 | P | 0.551 |
Discussion
Glucocerebrosidase (Uniprot-ID: P04062) is not listed in the Database of Protein Disorder Disport <ref>http://www.disprot.org/index.php</ref>, which indicates that there are no disordered regions in the protein. The different prediction methods predict probable disordered regions in different parts of the protein which are most of the time located in a certain structural element and therefore may be wrong. Furthermore the different predictions have no real overlaps apart from the regions in the signal peptide, where almost all tools show at least a few disordered residues. But the latter is not present in the mature protein and therefore describes no disordered region in the final structure.
Prediction of coiled coils
Our additional method to test another structural feature was the test of coiled coils. A coiled coil is a motif, which is often part of regulation of gene expression. Several alpha helices together build this motif. They usually contain a heptad repeat, which is a pattern hxxhcxc, where h are hydrophobic residues and c charged amino acid residues. This structure allows the special folding of coiled coils.<ref>http://en.wikipedia.org/wiki/Coiled_coil</ref> In our case we do not expect any coiled coils because Glucocerebrosidase is not a transcription factor and also does not show such helix structures. We used three tools to verify this assumption.
COILS
COILS is a method which compares the given sequence to a database of known coiled coils. With that it calculates a similarity score and so gets the probability of a coiled coil motif.<ref>Lupas, A., Van Dyke, M., and Stock, J. (1991), Predicting Coled Coils from Protein Sequences, Science 252:1162-1164</ref> The result can be seen in Figure 14 to the right.
Usage
- Webserver: http://www.ch.embnet.org/software/COILS_form.html
- Input: protein sequence in fasta format, for matrix and window width we used the standard parameter
MultiCoil
MultiCoil is based on Parcoil, which predicts coiled coils by pairwise residue correlations.<ref>Bonnie Berger, David B. Wilson, Ethan Wolf, Theodore Tonchev, Mari Milla, and Peter S. Kim, "Predicting Coiled Coils by Use of Pairwise Residue Correlations", Proceedings of the National Academy of Science USA, vol 92, aug 1995, pp. 8259-8263.</ref>. It is a better version for two- and three-stranded coiled coils.<ref>Ethan Wolf, Peter S. Kim, and Bonnie Berger, "MultiCoil: A Program for Predicting Two- and Three-Stranded Coiled Coils", Protein Science 6:1179-1189. June 1997.</ref>. The result is illustrated in Figure 15.
Usage
- Webserver: http://groups.csail.mit.edu/cb/multicoil/cgi-bin/multicoil.cgi
- Input: protein sequence in fasta format
Parcoil2
Parcoil2 works like Parcoil or Multicoil, but is an approved version of 2006, which uses again pairwise residue probabilities and an updated database.<ref>A.V. McDonnell, T. Jiang, A.E. Keating, B. Berger, "Paircoil2: Improved prediction of coiled coils from sequence", Bioinformatics Vol. 22(3) (2006)</ref> The result can be seen in Figure 16 to the right.
Usage
- Webserver: http://groups.csail.mit.edu/cb/paircoil2/paircoil2.html
- Input: protein sequence in fasta format, minimum search window and p-score as standard
Discussion
None of the three tools, COILS, MultiCoil and Parcoil2 predicted any coiled coil motif. This is not surprising as Glucocerebrosidase does not contain any coiled coils.
Prediction of transmembrane alpha-helices and signal peptides
General
Transmembrane topology
The topology of a membrane protein is characterized by the number of membrane spanning segments in the protein. The transmembrane regions of the protein are hydrophobic and have a length of aproximately 15-30 residues which is enough to cross the lipid bilayer of the membrane once. The different transmembrane regions are connected by hydrophilic loops which are located outside the membrane. These attributes can be used to predict the transmembrane topology of a protein.
Predictors: TMHMM, OCTOPUS
Signal peptides
Signal Peptides are located at the N-terminus of a protein sequence and direct the transport of a protein to its correct location. Signal Sequences have a typical size of 20-30 residues and have a tripartite structure: a hydrophobic core region (h-region) which is flanked by a basic n-region and a slightly polar c-region. The sequence variation among signal sequences affects ER targeting, translocation and signal peptidase cleavage. <ref>Hegde R.S. and Bernstein H.D. (2006) The surprising complexity of signal sequences. Trends Biochem Sci 31(10), 563-71</ref>
Predictors: SignalP, TargetP
Combined transmembrane and signal peptide prediction
The high similarity between the hydrophobic region of a transmembrane helix and the one of a signal peptide leads to cross-predictions when conventional transmembrane topology and signal peptide predictors as TMHMM and SignalP are used. Predictors which are based on submodels for both make less errors coming from cross-predictions and help to discriminate against false positives. Furthermore, a predicted signal peptide indicates that the N-terminus of the protein is non-cytoplasmic and is therefore helpful to assign the orientation of the protein. <ref>Käll L. Krogh A, & Sonnhammer, E. L. (2007) Advantages of combined tranasmembrane topology and signal peptide prediction - the Phobius web server. Nucleic Acids Res., Vol. 35, Web server issue, S.429-32</ref>
Predictors: Phobius, Polyphobius, SPOCTOPUS.
Topology of Glucocerebrosidase
Glucocerebrosidase is located in the lysosome and does not contain any transmembrane regions. Residues 1 to 39 are part of a signal peptide <ref>http://www.uniprot.org/uniprot/P04062</ref>. Therefore the prediction tools should place the protein in the non-cytoplasm and not find any transmembrane helices, but instead a signal peptide.
TMHMM
TMHMM is a method to predict transmembrane topology of membrane-spanning proteins. It is based on a hidden Markov model with an architecture of 7 types of states (helix core, helic caps on both sides, one loop on the cytoplasmic side, two loops on the non-cytoplasmic side and a globular domain in the middle of each loop) which correspond to the biological system. The method was established by Sonnhammer et al. in 1998 <ref>Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998;6:175–182.</ref>.
Usage
- Webserver: http://www.cbs.dtu.dk/services/TMHMM/
- Input: protein sequence in fasta format
Results
sp|P04062|GLCM_HUMAN Length: 536
sp|P04062|GLCM_HUMAN Number of predicted TMHs: 0
sp|P04062|GLCM_HUMAN Exp number of AAs in TMHs: 1.77867
sp|P04062|GLCM_HUMAN Exp number, first 60 AAs: 1.62607
sp|P04062|GLCM_HUMAN Total prob of N-in: 0.06840
sp|P04062|GLCM_HUMAN TMHMM2.0 outside 1 536
TMHMM predicts no transmembrane segments (cf. output and Figure 17) for the sequence of glucocerebrosidase which is correct.
Phobius and PolyPhobius
Phobius is based on a hidden Markov model which contains submodels for both transmembrane helices and signal peptides and therefore obtains a better discrimination between the two segments than predictors for only one of them. This method was presented in 2004 by Käll et al. <ref>Käll L., et al. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 2004;338:1027–1036.</ref> Polyphobius is a prediction server that additionally uses an algorithm to include homology information. The performance of transmembrane topology and signal peptide prediction is increased by incorporating extra support from homolougs. <ref>Käll L., et al. An HMM posterior decoder for sequence feature prediction that includes homology information Bioinformatics, 21 (Suppl 1):i251-i257, June 2005.</ref>
Usage - Phobius
- Webserver: http://phobius.sbc.su.se/
- Input: Protein sequence in fasta format
Results - Phobius
ID sp|P04062|GLCM_HUMAN
FT SIGNAL 1 39
FT REGION 1 19 N-REGION.
FT REGION 20 31 H-REGION.
FT REGION 32 39 C-REGION.
FT TOPO_DOM 40 536 NON CYTOPLASMIC.
Phobius predicts a signal peptide ranging from amino acid 1 to amino acid 39 of the sequence of glucocerebrosidase. This goes along with the information given on Uniprot <ref>http://www.uniprot.org/uniprot/P04062</ref> that the protein has a 39 residue signal sequence. The presence of a signal peptide explains the differences between the sequences of sp|P04062|GLCM_HUMAN and its corresponding PDB structure 1OGS: the sequence of 1OGS has 39 amino acids less than the sequence of sp|P04062|GLCM_HUMAN as the signal peptide is missing in the mature structure. The prediction, that the protein is non-cytoplasmic is true, as glucocerebrosidase is located in the lysosome.
Usage - PolyPhobius
- Webserver: http://phobius.sbc.su.se/poly.html
- Input: Protein sequence in fasta format
Results - PolyPhobius
ID sp|P04062|GLCM_HUMAN
FT SIGNAL 1 39
FT REGION 1 23 N-REGION.
FT REGION 24 34 H-REGION.
FT REGION 35 39 C-REGION.
FT TOPO_DOM 40 536 NON CYTOPLASMIC.
Polyphobius returns the same predictions as Phobius: a 39 residue signal sequence with slightly different region ranges.
OCTOPUS and SPOCTOPUS
OCTOPUS (obtainer of correct topologies for uncharacterized sequences) was developed in 2007 by Viklund et al. The method combines hidden markov models and artificial neural networks. Furthermore, OCTOPUS is the first method that integrates the modelation of reentrant-, membrane dip-, and TM hairpin regions. <ref>Viklund, H., and A. Elofsson. 2008. OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics 24:1662-1668.</ref> SPOCTOPUS is an extension of the OCTOPUS algorithm which additionally predicts signal peptides for reducing predictions of transmembrane regions as signal peptides and the other way round. The method was first mentioned by Viklund et al. in 2008. <ref>Viklund, H., et al. 2008. SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology Bioinformatics 24:2928-2929.</ref>
Usage
- Webserver: http://octopus.cbr.su.se/index.php
- Input: Protein sequence in fasta format
OCTOPUS - Results
OCTOPUS predicts a transmembrane segment ranging from amino acid 16 to amino acid 36. The region from amino acid 1 to 15 is predicted to be cytoplasmic and the remaining amino acids (from amino acid 37) are indicated to be non cytoplasmic. This is an example for a misclassification between transmembrane segments and signal peptides when a prediction tool for only transmembrane segments is used.
SPOCTOPUS - Results
SPOCTOPUS correctly identifies the regions from amino acid 1 to amino acid 39 as a signal peptide. The combination of a signal peptide and transmembrane segment prediction helps to eliminate the misclassification made by a transmembrane segment predictor (cf. results of OCTOPUS).
SignalP
SignalP uses a combination of several artificial neural networks and hidden Markov models to predict the presence and location of signal peptide cleavage sites of three different organism groups: eukaryotes, Gram-negative and Gram-positive bacteria. The current version 3.0 was published in 2004 by Bendtsen et al. <ref>Bendtsen JD, Nielsen H, von Heijne G, Brunak S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004;340:783–795.</ref>
Usage
- Webserver: http://www.cbs.dtu.dk/services/SignalP/
- Input: protein sequence in fasta format, organism group, method
Results
>sp|P04062|GLCM_HUMAN
Prediction: Signal peptide
Signal peptide probability: 0.516
Signal anchor probability: 0.001
Max cleavage site probability: 0.423 between pos. 39 and 40
SignalP predicts the signal peptide of glucocerebrosidase correctly. Even the cleavage site is located correctly between amino acid 39 and amino acid 40. The location of the different regions of the signal peptide can be seen in the illustration to the right.
TargetP
TargetP, a neural network-based tool, predicts the subcellular location of eukaryotic proteins based on the predicted presence of N-terminal presequences like mitochondrial targeting peptides, chloroplast trasit peptides and pathway signal peptides. For the latter two potential cleavage sites can be predicted as TargetP uses ChloroP and SignalP. This method, which only uses N-terminal sequence information, was published in 2000 by Emanuelsson et al. <ref>Emanuelsson O., et al. Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. J.Mol.Biol. (2000) 300, 1005-1016.</ref>
Usage
- Webserver: http://www.cbs.dtu.dk/services/TargetP/
- Input: protein sequence in fasta format, cutoff, organism group
Results
Name Len mTP SP other Loc RC TPlen
sp_P04062_GLCM_HUMAN 536 0.091 0.364 0.612 _ 4 -
cutoff 0.000 0.000 0.000
TargetP does not predict the signal peptide of glucocerebrosidase. Although the score for a signal peptide is much higher than the one for a mitochondrial targeting peptide, it locates the protein with a very low reliability class rather at any other location than in chloroplast, mitochondrion or the secretory pathway.
Further Examples of Use
As the protein glucocerebrosidase does not contain any transmembrane segments, the application of the different tools mentioned above will further be demonstrated with some other proteins: Bacteriorhodopsin (BACR_HALSA)<ref>http://www.uniprot.org/uniprot/P02945</ref>, Retinol-binding protein 4 (RET4_HUMAN)<ref>http://www.uniprot.org/uniprot/P02753</ref>, Insulin-like peptide INSL5 (INSL5_HUMAN)<ref>http://www.uniprot.org/uniprot/Q9Y5Q6</ref>, Lysosome-associated membrane glycoprotein 1 (LAMP1_HUMAN)<ref>http://www.uniprot.org/uniprot/P11279</ref> and Amyloid beta A4 protein (A4_HUMAN)<ref>http://www.uniprot.org/uniprot/P05067</ref>.
Topology
The table below indicates the correct number of transmembrane segments, the localization of the N-terminus and presence or absence of a signal peptide in the five proteins mentioned above. A more detailed description with the exact positions of the membrane segments and the cleavage sites can be seen in the corresponding Uniprot entries.
Protein | # Transmembrane Segments | Signal Peptide | Localization of N-terminus |
---|---|---|---|
BACR_HALSA | 7 | 0 | extracellular |
RET4_HUMAN | 0 | 1 | extracellular |
INSL5_HUMAN | 0 | 1 | extracellular |
LAMP1_HUMAN | 1 | 1 | lumenal |
A4_HUMAN | 1 | 1 | extracellular |
TMHMM
The topology predictions of TMHMM for the different proteins are illustrated in the graphic below. TMHMM predicts 6 different transmembrane segments for BACR_HASLA, but indicates, that a signal peptide is possible. According to Uniprot, BACR_HASLA consists of 7 different transmembrane segments and no signal peptide. The last transmembrane helix, ranging from amino acid 217 to 236 was not predicted by TMHMM. The prediction that RET4_HUMAN and INSL5_Human do not contain any transmembrane segments goes along with the corresponding Uniprot entries. TMHMM finds 2 transmembrane segments in LAMP1_HUMAN. The first transmembrane segment is in reality a signal peptide, as indicated in the Uniprot entry and is therefore another example for the confusion of transmembrane segments and signal peptides. The single transmembrane helix of A4_HUMAN was predicted correctly.
Phobius and PolyPhobius
The results of Phobius and PolyPhobius are identical. There are only minor differences in the length of the different regions of a signal peptide. The topology predictions of both methods for the different proteins are illustrated in the graphic below. The predictions for all of the proteins were made correctly by Phobius and PolyPhobius. Sometimes the transmembrane segments are shifted some amino acid positions to the right or to the left. The ortientation of the proteins, the presence of signal peptides and the overall topology of the proteins go along with the corresponding Uniprot entries.
OCTOPUS
OCTOPUS makes a lot of false predictions resulting from a confusion between signal peptides and transmembrane regions. Each signal peptide was predicted as either transmembrane segment or reentrant/dip region. BACR_HASLA, which does not have a signal peptide, was predicted correctly.
SPOCTOPUS
SPOCTOPUS, which expends the OCTOPUS algorithm with a signal peptide prediction, predicts the overall topology, orientation of the protein and presenence of signal peptides in each case correctly. This is a very good example, that a combined signal peptide and transmembrane prediction is more reliable and makes less errors than a single transmembrane prediction.
SignalP
If a signal peptide is present in the protein, SignalP predicts it with a very high confidence, both with the neural networks and the hidden Markov models. The presence of a signal peptide is predicted correctly for the proteins RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN and A4_HUMAN. The results of the neural network and the hidden Markov models differed for the protein BACT_HALSA which does not have a signal peptide. The S-Score of the neural networks indicates that there is a signal peptide and that the corresponding cleavage site is between position 38 and 39. In contrast, the hidden Marcov models predict a signal anchor.
Protein | Signal Anchor Prob. | Signal Peptide Prob. | Cleavage Site Prob. | Cleavage Site |
---|---|---|---|---|
BACR_HALSA | 0.86 | 0.02 | 0.00 | 15-16 |
RET4_HUMAN | 0.00 | 1.00 | 0.98 | 18-19 |
INSL5_HUMAN | 0.00 | 1.00 | 0.91 | 22-23 |
LAMP1_HUMAN | 0.00 | 1.00 | 0.85 | 28-29 |
A4_HUMAN | 0.00 | 1.00 | 0.99 | 17-18 |
TargetP
TargetP predicts present signal peptides (for proteins RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN and A4_HUMAN) correctly and with a very high (1-2) reliability class in each case. The method predicts a signal peptide for BACR_HALSA as well, but the reliablility class is in this case very low (4), which indicates that the prediction is not very safe.
Name Len mTP SP other Loc RC TPlen
sp_P02945_BACR_HALSA 262 0.019 0.897 0.562 S 4 116
sp_P02753_RET4_HUMAN 201 0.242 0.928 0.020 S 2 18
sp_Q9Y5Q6_INSL5_HUMA 135 0.074 0.899 0.037 S 1 22
sp_P11279_LAMP1_HUMA 417 0.043 0.953 0.017 S 1 28
sp_P05067_A4_HUMAN 770 0.035 0.937 0.084 S 1 17
cutoff 0.000 0.000 0.000
Discussion
The application of the different tools for transmembrane region and signal peptide prediction to a variety of proteins shows that predictors which combine both elements are more reliable and make less errors coming from misclassifications than single predictors. Therefore, if possible, a predictor for both, transmembrane segment and signal peptide like SPOCTOPUS, Polyphobius or Phobius, should be used.
Prediction of GO terms
General
The Gene Ontology Consortium tries to unify the terminology of gene and gene product attributes across all species to decrease non-consistent descriptions in different databases. It therefore has developed three different ontologies: cellular component, molecular function and biological process. <ref>http://www.geneontology.org/GO.doc.shtml</ref> For each prediction, precision (tp/(tp+fp)) and recall (tp/(tp+fn)) are calculated by comparing the paths to the root in the GO-tree of the predicted and the reference terms. Comparing the paths to the root, allows one to not count a predicted parent of a reference GO-term as false positive, but instead as true positive and the resulting values are therefore more reliable than the ones obtained by just comparing the predicted with thre reference GO-terms. In Figure 21, one can see an exemplary GO tree with a reference annotation and a prediction. If one would just compare the GO terms without taking the whole tree into account, this would result in zero true positives and one false negative. In contrast, if one uses the paths to the root of the tree of both reference annotation and prediction, one gets three true positives and one false negative. For the GO-tree, the transitive closure<ref>http://www.geneontology.org/scratch/transitive_closure/go_transitive_closure.links</ref> of june 2008 is used. This file is not up to date and therefore does not contain each GO-term, but still helps to give an impression of how good the predictions are.
Annotation of Glucocerebrosidase
The following GO-term annotations are taken from Uniprot<ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=P04062</ref> and are used as reference for a comparison with the different prediction tools.
Accession | Term | Ontology |
---|---|---|
GO:0005975 | carbohydrate metabolic process | biological process |
GO:0008219 | cell death | biological process |
GO:0006629 | lipid metabolic process | biological process |
GO:0007040 | lysosome organization | biological process |
GO:0006665 | sphingolipid metabolic process | biological process |
GO:0008152 | metabolic process | biological process |
GO:0005765 | lysosomal membrane | cellular component |
GO:0016020 | membrane | cellular component |
GO:0005764 | lysosome | cellular component |
GO:0043169 | cation binding | molecular function |
GO:0003824 | catalytic activity | molecular function |
GO:0004348 | glucosylceramidase activity | molecular function |
GO:0016798 | hydrolase activity, acting on glycosyl bonds | molecular function |
GO:0016787 | hydrolase activity | molecular function |
GO:0005515 | protein binding | molecular function |
GOPET
GOPET, a Gene Ontology term Prediction and Evaluation Tool, uses homology searches and Support Vector Machines to predict the molecular function GO-terms for sequences of any organism. It was made public in 2006 by Vinayagam et al. <ref>Vinayagam A., et al. GOPET: A tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006; 7: 161.</ref>
Usage
- Webserver: http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar
- Input: protein sequence in fasta format
Results
GOPET predicts 3 different GO-terms for glucocerebrosidase. The annotations made by GOPET are correct: each predicted GO-term is listed in the corresponding Uniprot entry of glucocerebrosidase. GOPET does not predict two molecular function GO-terms: protein binding and cation binding. In total, GOPET achieves a recall of 0.67 and a precision of 1.00 if only the GO-terms corresponding to molecular function are taken into account.
GOid | Aspect | Confidence | GO term |
---|---|---|---|
GO-ID:0016787 | F | 98% | hydrolase activity |
GO-ID:0004348 | F | 97% | glucosylceramidase activity |
GO-ID:0016798 | F | 97% | hydrolase activity acting on glycosyl bonds |
Pfam
The database Pfam contains protein families and domains and is based on hidden Markov models. It consists of two parts: Pfam-A is curated and therefore contains high quality data whereas Pfam-B is generated automatically. Pfam was presented by Sonnhammer et al. in 1997. <ref>Sonnhammer E., et al., Pfam: A Comprehensive Database of Protein Domain Families Based on Seed Alignments. PROTEINS: Structure, Function, and Genetics 28:405-420(1997)</ref>
Usage
- Database search: http://pfam.sanger.ac.uk/search
- Input: protein sequence in fasta format
- use families of significant Pfam-A matches to search in the pfam2go<ref>http://www.geneontology.org/external2go/pfam2go</ref> file for corresponding GO-terms
Results
Pfam assigns glucocerebrosidase to the "O-Glycosyl hydrolase family 30". To retrieve the GO-annotations for this family, the pfam2go file <ref>http://www.geneontology.org/external2go/pfam2go</ref> of the Gene Ontology website had to be used. Each GO-term listed in Pfam is listed in the corresponding Uniprot site as well. This results in a precision of 1.00 and a recall of 0.65.
Accession | Term | Ontology |
---|---|---|
GO:0004348 | glucosylceramidase activity | molecular function |
GO:0006665 | sphingolipid metabolic process | biological process |
GO:0007040 | lysosome organization | biological process |
GO:0005764 | lysosome | cellular component |
ProtFun 2.2
ProtFun is an ab initio prediction server for protein function which is based on sequence derived protein features as predicted post translational modifications, protein sorting signals and phisical/chemical properties that have been calculated from the amino acid composition of the input sequence. <ref>Jehnsen L., et al. Ab initio prediction of human orphan protein function from post-translational modifications and localization features. J. Mol. Biol., 319:1257-1265, 2002</ref>
Usage
- Webserver: http://www.cbs.dtu.dk/services/ProtFun/
- Input: protein sequence in fasta format
Results
ProtFun assigns glucocerebrosidase to immune response (GO:0006955) which is not tue. Therefore the results of ProtFun have a recall and a precission of 0.00.
Functional category Prob Odds
Amino_acid_biosynthesis 0.035 1.593
Biosynthesis_of_cofactors 0.182 2.528
Cell_envelope => 0.504 8.262
Cellular_processes 0.032 0.438
Central_intermediary_metabolism 0.382 6.063
Energy_metabolism 0.067 0.740
Fatty_acid_metabolism 0.027 2.088
Purines_and_pyrimidines 0.538 2.213
Regulatory_functions 0.031 0.191
Replication_and_transcription 0.126 0.471
Translation 0.082 1.863
Transport_and_binding 0.560 1.365
Enzyme/nonenzyme Prob Odds
Enzyme => 0.773 2.698
Nonenzyme 0.227 0.318
Enzyme class Prob Odds
Oxidoreductase (EC 1.-.-.-) 0.083 0.399
Transferase (EC 2.-.-.-) 0.228 0.660
Hydrolase (EC 3.-.-.-) 0.272 0.859
Lyase (EC 4.-.-.-) 0.045 0.961
Isomerase (EC 5.-.-.-) 0.011 0.345
Ligase (EC 6.-.-.-) 0.017 0.332
Gene Ontology category Prob Odds
Signal_transducer 0.054 0.251
Receptor 0.027 0.158
Hormone 0.001 0.206
Structural_protein 0.002 0.087
Transporter 0.024 0.222
Ion_channel 0.018 0.307
Voltage-gated_ion_channel 0.004 0.195
Cation_channel 0.012 0.268
Transcription 0.070 0.550
Transcription_regulation 0.030 0.237
Stress_response 0.085 0.962
Immune_response => 0.153 1.804
Growth_factor 0.005 0.376
Metal_ion_transport 0.009 0.020
Further Examples of Use
The different prediction tools have also been applied to the additional proteins already used in the transmembrane and signal peptide prediction section. The reference annotations and the predictions are listed in the following pdf-File: File:GO term prediction for several different proteins.pdf. The resulting precision and recall values are listed in the table below. This time, recall and precision of GOPet have been calculated by a comparison to all annotated GO-terms of the corresponding proteins, not only to the ones being a molecular function.
The Pfam database search is the only method which returns only correct GO-terms for each of the 5 proteins, although the coverage is only high for INSL5. GOPet, as well as ProtFun do not predict any correct GO-terms for LAMP1, which is quite interesting, as the precision for the other proteins is quite high (apart from A4_HUMAN: in this case, the predicted gene ontology category of ProtFun could not be mapped to a certain GO-term, and therefore neither precision or recall could be calculated). The low recall values of ProtFun and GOPet may have reasons: GOPet only predicts GO-terms of the ontology "molecular function" and ProtFun only classifies the protein to one gene ontology category.
Bacteriorhodopsin | Retinol-binding protein 4 | Insulin-like peptide INSL5 | Lysosome-associated membrane glycoprotein 1 |
Amyloid beta A4 protein | |
---|---|---|---|---|---|
GOPet | |||||
Precision | 0.62 | 0.62 | 1.00 | 0.00 | 0.70 |
Recall | 0.21 | 0.04 | 0.57 | 0.00 | 0.05 |
Pfam | |||||
Precision | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Recall | 0.26 | 0.01 | 0.71 | 0.07 | 0.03 |
ProtFun 2.2 | |||||
Precision | 1.00 | 1.00 | 1.00 | 0.00 | - |
Recall | 0.02 | 0.02 | 0.57 | 0.00 | - |
PDF-File containing the results of the different methods and the correct GO-term annotations: File:GO term prediction for several different proteins.pdf
Discussion
GOPET predicts GO terms corresponding to molecular function very well. The fact, that the recall values are quite low for the further examples of use is due to the fact, that all GO-terms and not only the ones for molecular function have been used as reference. The GO-terms retrieved with Pfam are very accurate: in each prediction, only correct GO-terms have been predicted which results in a precision of 1.0. ProtFun2.2 differs from the other two prediction methods, as it only assigns the protein to one gene ontology category. This resuls in very low recall values and it happend several times, that the prediction was not correct.
None of the three different methods was able to predict each of the annotated GO-terms. But this is not surprising, as the Gene Ontology consists of a very large number of terms. GOPET and Pfam are therefore a good solution to get a broad overview of the different functions and localizations of your protein of interest.
References
<references />