Glucocerebrosidase sequence based prediction
Contents
Secondary structure prediction
General
The secondary structure of a protein is the three-dimensional form. In contrast to the tertiary structure it describes the local segments. Because of weak chemical forces like hydrogen bonds and the values of the φ and ψ angles they form different structures. The main types are α-helices and parallel and anti-parallel β-sheets. Some rare structures are π-helices and 3,10-helices. Another possibility are coils, which are irregular formed elements.
A protein consists of several secondary structure elements which build together the tertiary structure. <ref>http://en.wikipedia.org/wiki/Biomolecular_structure#Secondary_structure</ref>
PSIPRED
PSIPRED is a method by David T. Jones, published 1999 in JMB with "Protein Secondary Structure Prediction Based on
Position-specific Scoring Matrices". PSIPRED works with a two-stage neural network to predict secondary structure. These are based on the position specific scoring matrices generated by PSI-BLAST, which is run before.<ref>David T. Jones, Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices, JMB, 1999</ref>
As input only the protein sequence is needed.
We run the online and the local version of PSIPRED and got different results. In the following it is compared to the secondary structure given in Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>.
Conf: | 988898954488887622315999999999998641038968865325999649995388
|
online: | CCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEECCC
|
Conf: | 987898955489988742200466888998986410038977877777863169974474
|
local: | CCCCCCCCCCCCCCCCCEEEEHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEECCC
|
uniprot: | ------------------------------------------------EEEE-EEEEEE-
|
AA: | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
|
Conf: | 558889998889992599996377885421237645688875108995378301079247
|
online: | CCCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCEEEEEEE
|
Conf: | 148899998788875431100345640022100111177897107840966454557422
|
local: | CCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEEEECCCCCCCCEEEECCCCCCCEEEEE
|
uniprot: | -------------EEEEEEEE-----EEEEEEE-EEE----EEEEEEEEEEEEEE--EEE
|
AA: | YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
|
Conf: | 300233899997249999999999860597882105999750588999986666899999
|
online: | EECCCHHHHHHHHCCCHHHHHHHHHHCCCCCCCEEEEEEEEECCCCCCCCCCCCCCCCCC
|
Conf: | 011335889987508927898998851396893001358621344677653324799999
|
local: | CCCCCHHHHHHHHHCCHHHHHHHHHHHCCCCCCEEEEEEEECCCCCCCCCCCCCCCCCCC
|
uniprot: | EE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE--
|
AA: | GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
|
Conf: | 689999994100245289999999971999389971377785612147247999889999
|
online: | CCCCCCCCCHHCCCCCHHHHHHHHHHCCCCCEEEECCCCCCCCCEECCCCCCCCCCCCCC
|
Conf: | 721111368543220024799998733999689957899974220056347854325899
|
local: | CCCCCCCCCCCCCCCHHHHHHHHHHHCCCCCEEEECCCCCCCCCCCCCCCCCCCCCCCCC
|
uniprot: | --------HHHH--HHHHHHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE---
|
AA: | FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
|
Conf: | 922699999999999999975490786872012579899999999986349999999999
|
online: | CCHHHHHHHHHHHHHHHHHHHCCEEEEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHH
|
Conf: | 971468799999999967663395143898112789787678873222114422121122
|
local: | CCHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCHHHH
|
uniprot: | -HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHHHHH
|
AA: | GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
|
Conf: | 955799851689972999944888873334664149955640224689831699998033
|
online: | HHHHHHHHCCCCCCEEEEEECCCCCCHHHHHHHHCCCHHHHCCCCEEEEEECCCCCCHHH
|
Conf: | 111332310577410134212544556520222238976651151878702212236320
|
local: | HHHHHHHHCCCCCCCEEEEECCCCCCCCCCHHHHCCCHHHHHCCEEEEEECCCCCCCCCC
|
uniprot: | -HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHH
|
AA: | RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
|
Conf: | 412688750999509994343699998866567831444255999999996402335772
|
online: | HHHHHHHHCCCCCEEEEECCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCEEEEEE
|
Conf: | 011111126898001101210389653344579861210143212566552001100000
|
local: | CCCCCCCCCCCCCCEEHHHHCCCCCCCCCCCCCCCHHHHCCCCHHHHHHHHHHHHHHEEE
|
uniprot: | HHHHHHHH---EEEEEEEEE--------------HHHHHHHHHHHHHHHH--EEEEEEEE
|
AA: | ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
|
Conf: | 000169999986689878535895679769986202333102244469939999542389
|
online: | EECCCCCCCCCCCCCCCCCCEEEECCCCEEEECCHHHHHHHHCCCCCCCCEEEEEEECCC
|
Conf: | 023699999860001325228998208703226821232123444679927984435078
|
local: | CCCCCCCCCCCCEECCCCCCEEEEECCCCEEECCCEEEECCCCCCCCCCCEEEEEEEECC
|
uniprot: | E-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--
|
AA: | NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
|
Conf: | 99028999928998999999099993779999099304998518951899999309
|
online: | CCCEEEEEECCCCCEEEEEECCCCCCEEEEEEECCCCEEEEECCCCEEEEEEEEEC
|
Conf: | 99538995649997799999146898214741998642000389842568774139
|
local: | CCCCEEEEECCCCCEEEEEEECCCCCEEEEECCCCCCCCCCCCCCCEEEEEEEECC
|
uniprot: | EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----
|
AA: | NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
The results differ a lot. The online version of PSIPRED seems to have different parameters than the local version. So we got different results. Compared to the given secondary structure in Uniprot there are many regions that are predicted wrong.
Jpred3
Jpred3 was published 2008 by Christian Cole, Jonathan D. Barber and Geoffrey J. Barton as "The Jpred 3 secondary structure prediction server" in Nucl. Acids Res. The Jnet algorithm predicts the secondary structure and solvent accessibility with the help of alignment profiles. Therefore it uses the position-specific scoring matrix (PSSM) from PSI-BLAST and a hidden Markov model. The prediction is made with a neural network.<ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.full</ref>
As input only the protein sequence is needed. Alternatively you can also use a multiple sequence alignment.
OrigSeq | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
|
Jnet | -----------------HHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
|
jhmm | ----------------HHHHHHHHHHHHHHHHHHHHH---------------EEEEE-----------------EEEEEEE------------------
|
jpssm | ------------------HHHHHHHHHHHHHHHHHH---------------EEEEEE-----------------EEEEEEE------------------
|
uniprot | ------------------------------------------------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE--
|
OrigSeq | TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIP
|
Jnet | ----EEEEEE----EEEEEEEEEEHHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
|
jhmm | ----EEEEE-----EEEEEEEEEEEHHHHHHHH----HHHHHHHHHHH-------EEEEEE----------------------------------HHHH
|
jpssm | ----EEEEEE----EEEEEEEE-HHHHHHHHHH----HHHHHHHHHHH------EEEEEEEE--------------------------------HHHHH
|
uniprot | --EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEEEEEE--EEEEE------EEE----------HHHH--HHHH
|
OrigSeq | LIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRD
|
Jnet | HHHHHHHH----EEEEE--------EE-------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
|
jhmm | HHHHHHHH----EEEEE-----------------------------HHHHHHHHHHHHHHHHH----EEEEE---------------------HHHHHH
|
jpssm | HHHHHHHHH---EEEEE--------EEE-----------------HHHHHHHHHHHHHHHHHH----EEEEEE--------------------HHHHHH
|
uniprot | HHHHHHH-----EEEEEEE---HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH------------HHHHHHH
|
OrigSeq | FIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSW
|
Jnet | HHHHHHHHHHHH-----EEEEEE--------HHHHHHH--HHHHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
|
jhmm | HHHHHHHHHHHH------EEEEE-------HHHHHHHH--H-HHHHHHEEEE----------HHHHHHHHHH-----EEEEEEEE--------------
|
jpssm | HHHHHHHHHHHH-----EEEEEEE-------HHHHHH----HHHHHH--EE-----------HHHHHHHHHH-----EEEEEEE--------------H
|
uniprot | HHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HHHH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HH
|
OrigSeq | DRGMQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSA
|
Jnet | HHHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEEE-------EEEEEEE-----E
|
jhmm | -HHHHHHHHHHHHHHHHHHHHHHHHHH----------------EEEEE----EEEE---HHHHHHHH-------EEEE--------EEEEEEE-----E
|
jpssm | HHHHHHHHHHHHHHHHHHHHHHHHHHE----------------EEEEE----EEEE--HHHHHHHH--------EEEEEE------EEEEEEEE----E
|
uniprot | HHHHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHHHHHH-------EEEEEEEEE--EEEEEEEE-----EE
|
OrigSeq | VVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
Jnet | EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
|
jhmm | EEEEEE-----EEEEEEE---EEEEEEE----EEEEEEE--
|
jpssm | EEEEEE----EEEEEEEE---EEEEEEE---EEEEEEE---
|
uniprot | EEEEE-EEE-EEEEEEECCCEEEEEEE---EEEEEEE----
|
Jpred3 predicts the bigger part of the protein correctly. Some parts are not predicted, where are helices or beta sheets and only one time it predicted sheets instead of helices.
For the prediction it used a lot of hits of Blast with an E-value of 0, and also one with an E-value of 2e-52:
2wkl, 3keh, 3ke0, 3gxm, 3gxi, 3gxf, 3gxd, 2wcg, 2vt0, 2v3f, 2v3e, 2v3d, 2nt1, 2nt0, 2nsx, 2j25, 2f61, 1y7v, 1ogs, 2wnw.
All but 2wnw are glucoceramidase proteins of Homo Sapiens, 1ogs is exactly that we use. 2wnw is a hydrolyse activated by transcription factor from salmonella typhimurium. It seems to be also glucoceramidase. Some examples of the used proteins are shown in the following pictures.
Comparison with DSSP
DSSP, which stands for Define Secondary Structure of Proteins, is by Wolfgang Kabsch and Chris Sander, who published it 1983 in Biopolymers with the title "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features".<ref>http://swift.cmbi.ru.nl/gv/dssp/</ref>
The DSSP algorithm recognises the secondary structure. Therefore it defines hydrogen bonds with an electrostatic definition. The different patterns of hydrogen bonds constitute one of eight possible secondary structure types. So it is no prediction tool.<ref>http://en.wikipedia.org/wiki/DSSP_%28protein%29</ref>
10 20 30 40 50 60
| | | | | |
1 - 60 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
1 - 60 SSS TTTTSSSSSSTT TTSSSSSSSSTTT TSSSSSS TT
1 - 60 * *
1 - 60 AAA AAAAAAAA A AA A A A AA A AA A AAAA
70 80 90 100 110 120
| | | | | |
61 - 120 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
61 - 120 TSSSSSSSSSSSSS SSSSS HHHHHHHTTT HHHHHHHHHHHHTTTTT SSS
61 - 120 * * *
61 - 120 A A A A AAAA A A A AAA A A AA
130 140 150 160 170 180
| | | | | |
121 - 180 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
121 - 180 SSST TTTTT T TTT TT TT HHHHTTHHHHHHHHHHH TT SSSSSST
121 - 180 ** * **
121 - 180 AAA AAAA AA AA A AA A AA AAA A A
190 200 210 220 230 240
| | | | | |
181 - 240 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
181 - 240 333 TT TTTTT TT TTTHHHHHHHHHHHHHHHHHHHTT TSSST TTTT333
181 - 240 ***** *
181 - 240 AAAAAA A AAA AAA A A A A AAA A A
250 260 270 280 290 300
| | | | | |
241 - 300 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
241 - 300 TTT T HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
241 - 300 ** * *
241 - 300 AAAAA A AA A A A AA A AA A A A A A A AA
310 320 330 340 350 360
| | | | | |
301 - 360 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
301 - 360 HHTT SSSSSSSTTT HHHHHHHHHHH TTTSSSSSSSS TTT T TT HHHH
301 - 360 ** * * **
301 - 360 AA A AA A AA A AA AA AAA A A A
370 380 390 400 410 420
| | | | | |
361 - 420 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
361 - 420 HHHHHHHHHHHHTTSSSSSSSST TTT TT TSSSS333TSSSS HHHHHH
361 - 420 *** *** **
361 - 420 A AA AAA A A A AAA
430 440 450 460 470 480
| | | | | |
421 - 480 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
421 - 480 HHHHTT TT SSSSSSSTT TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
421 - 480 ***** * *
421 - 480 A AAAAAAA AAA A AAA A A AA A
490
|
481 - 497 ETISPGYSIHTYLWHRQ
481 - 497 SSSS TTSSSSSSS
481 - 497 *
481 - 497 A A A AA
500 510 520 530 540 550
| | | | | |
498 - 557 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANH
498 - 557 SS TTTT SSSS TT T TTSSSSSSSSTTT TSSSSSS T
498 - 557 * * * * ****** *
498 - 557 AAA AAAAAAAA A AAAAAA AA A AA A A A A AAAA
560 570 580 590 600 610
| | | | | |
558 - 617 TGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIR
558 - 617 TTSSSSSSSSSSSSS SSSSS HHHHHHHHTT HHHHHHHHHHHHTTTTT SSS
558 - 617 * *
558 - 617 AAAA A A AAAA A A AA AAA A A A
620 630 640 650 660 670
| | | | | |
618 - 677 VPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWT
618 - 677 SSST TTTTT T TTT TT TT HHHHTTHHHHHHHHHHH TT SSSSSST
618 - 677 *** * *
618 - 677 AAA AAAA A AA A A AAA AA AAA AAA
680 690 700 710 720 730
| | | | | |
678 - 737 SPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL
678 - 737 333 TT TTTTT TT TTTHHHHHHHHHHHHHHHHHHHTT TSSST TT33333
678 - 737 *
678 - 737 A AAA A A AAA A A A A AAA A A
740 750 760 770 780 790
| | | | | |
738 - 797 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPE
738 - 797 TTT T HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
738 - 797 *** **
738 - 797 AAA A A A AA A A AA A AA A A A A A A AA
800 810 820 830 840 850
| | | | | |
798 - 857 AAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRG
798 - 857 HHTT SSSSSSSTTT HHHHHHHHHHH TTTSSSSSSSST TTTT T TT HHHH
798 - 857 ** * ** * **
798 - 857 A A AA A AA A AAA AA AAA A A A
860 870 880 890 900 910
| | | | | |
858 - 917 MQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHL
858 - 917 HHHHHHHHHHHHTTSSSSSSSST TTT TT TSSSS333TSSSS HHHHHH
858 - 917 * ** **
858 - 917 A AA AA AAAA AAA
920 930 940 950 960 970
| | | | | |
918 - 977 GHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL
918 - 977 HHHHTT TT SSSSSSSTT TSSSSSSS TTT SSSSSSS TTT SSSSSSSTTTSSS
918 - 977
918 - 977 A AAAAAAA AA A AAA A A A A
980 990
| |
978 - 994 ETISPGYSIHTYLWHRQ
978 - 994 SSSS TTSSSSSSS
978 - 994
978 - 994 A A AA
The result of DSSP has four lines: The first line describes the amino acids, the second the secondary structure, the third residues involved in symmetry contacts, which are marked with an asterix (*) and the fourth solvent accessible residues, which are marked with an A.
The secondary structure has the elements H, T, 3 and S, where H indicates an alpha helix, T a hydrogen bonded turn, 3 a residue in an isolated beta-bridge and S a bend, which is a region of high curvature.
The structure differs from our GBA sequence, because it used the PDB-file, which contains the sequence without the signaling peptide at the beginning and both domains of the protein. In the following we compare the secondary structure of domain A by DSSP with the secondary structure in Uniprot.
uniprot | ----------EEEE-EEEEEE--------------EEEEEEEE-----EEEEEEE-EEE-
|
DSSP | -----SSS-TTTTSSSSSSTT------------TTSSSSSSSSTTT--TSSSSSS--TT-
|
uniprot | ---EEEEEEEEEEEEEE--EEEEE--HHHHHHH----HHHHHHHHHHHH-CCCC---EEE
|
DSSP | TSSSSSSSSSSSSS--SSSSS--HHHHHHHTTT-HHHHHHHHHHHHTTTTT---SSS
|
uniprot | EEEE--EEEEE------EEE----------HHHH--HHHHHHHHHHH-----EEEEEEE-
|
DSSP | SSST--TTTTT---T--TTT-TT-TT----HHHHTTHHHHHHHHHHH-TT--SSSSSST-
|
uniprot | --HHH----EEEEE-EEEE----HHHHHHHHHHHHHHHHHHH-----EEEEE-----HHH
|
DSSP | --333-TT-TTTTT---TT-TTTHHHHHHHHHHHHHHHHHHHTT---TSSST-TTTT333
|
uniprot | ------------HHHHHHHHHH-HHHHHH--CCCCEEEEEEEEEHHH--HHHHHHH--HH
|
DSSP | TTT--T------HHHHHHHHHHTHHHHHHTTTTTTTSSSSSSSS333TTHHHHHHHTTHH
|
uniprot | HH----EEEEEEE------HHHHHHHHHHH---EEEEEEEEE--------------HHHH
|
DSSP | HHTT--SSSSSSSTTT---HHHHHHHHHHH-TTTSSSSSSSS----TTT-T--TT-HHHH
|
uniprot | HHHHHHHHHHHH--EEEEEEEEE-----------------EEEEEHHH-EEEE-HHHHHH
|
DSSP | HHHHHHHHHHHHTTSSSSSSSST---TTT---TT------TSSSS333TSSSS-HHHHHH
|
uniprot | HHHH-------EEEEEEEEE--EEEEEEEE-----EEEEEEE-EEE-EEEEEEECCCEEE
|
DSSP | HHHHTT--TT-SSSSSSSTT--TSSSSSSS-TTT-SSSSSSS-TTT-SSSSSSSTTTSSS
|
uniprot | EEEE---EEEEEEE---
|
DSSP | SSSS-TTSSSSSSS---
|
The result with DSSP is good. Most helices are right and the beta sheets are determined mostly as bends. Sometimes the helices or beta sheets are mixed up with turns, but overall it fits quite well.
Prediction of disordered regions
General
Disordered regions are regions with no fixed secondary and therefore tertiary structure. The Ramachandran angles of these regions are very flexible, which means, they can have more than one secondary structure which often depends on the binding to a substrate.
There are two types of disordered regions, the extended or random coil like ones and the collapsed or molten gloguble like ones. They have a lot of different functions, for example they are responsible for the DNA/RNA/protein recognition or the specificity and affinity of the binding. They are often involved in regulatory functions.<ref>http://www.pondr.com/pondr-tut1.html</ref>
DISOPRED
DISOPRED was published by Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF and Jones DT in 2004 in the Journal of Molecular Biology with the title "Prediction and functional analysis of native disorder in proteins from the three kingdoms of life".<ref>http://bioinf.cs.ucl.ac.uk/disopred/</ref>
The method is based on a neural network which was trained with the SVMlight support vector machine package. DISOPRED first uses PSI-BLAST with a filtered sequence database and uses the position-specific scoring matrix at the final iteration to generate inputs for DISOPRED.<ref>http://cms.cs.ucl.ac.uk/typo3/fileadmin/bioinf/Disopred/disopred_help.html</ref>
DISOPRED predictions for a false positive rate threshold of: 5%
conf | 999999999988777630000000000000000000000000000000000000000000
|
pred | ****************............................................
|
AA | MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
|
conf | 000000000000000000000001456676677776543210000000000000000000
|
pred | .........................************.......................
|
AA | YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
|
conf | 000000000000000000000000000000000000000000000000000021000000
|
pred | ............................................................
|
AA | FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
|
conf | 000000000000000000000000000000000000000000000000000000000000
|
pred | ............................................................
|
AA | ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
|
conf | 000000000000000000000000000000000000000000000000000000022331
|
pred | ............................................................
|
AA | NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
|
conf | 00000000000000000000000000000000000000000000000000000004
|
pred | ........................................................
|
AA | NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
|
Asterisks (*) represent disorder predictions and dots (.) prediction of order. The confidence estimates give a rough indication of the probability that each residue is disordered.
The interesting thing is, that the signale peptide is marked as a disordered region and has also a very high confidence. The second as disordered predicted region is part of a beta sheet. So there the prediction may be wrong.
POODLE
POODLE stands for Prediction Of Order and Disorder by machine LEarning. It is by S. Hirose, K. Shimizu, N. Inoue, S. Kanai and T. Noguchi and was published in 2008 in CASP8 Proceedings as "Disordered region prediction by integrating POODLE series".
It consists of three predictions:
- short disorder regions prediction (POODLE-S: short disorder regions, missing region in X ray structure or high B-factor region)
- long disorder regions prediction (POODLE-L: mainly longer than 40 consecutive amino acids)
- unfolded protein prediction
POODLE-I is based on a work-flow approach. POODLE uses machine learning and only needs the amino acid sequence for prediction.<ref>http://mbs.cbrc.jp/poodle/help.html</ref>
POODLE-I
Predicted as disordered:
no. | AA | ORD/DIS | Prob. |
1 | M | D | 0.851 |
2 | E | D | 0.813 |
3 | F | D | 0.781 |
4 | S | D | 0.753 |
5 | S | D | 0.757 |
6 | P | D | 0.813 |
7 | S | D | 0.777 |
8 | R | D | 0.743 |
9 | E | D | 0.699 |
10 | E | D | 0.707 |
11 | C | D | 0.711 |
12 | P | D | 0.656 |
13 | K | D | 0.633 |
14 | P | D | 0.609 |
15 | L | D | 0.581 |
16 | S | D | 0.553 |
17 | R | D | 0.535 |
18 | V | D | 0.51 |
... | |||
95 | I | D | 0.58 |
96 | Q | D | 0.812 |
97 | A | D | 0.892 |
98 | N | D | 0.744 |
99 | H | D | 0.551 |
The first as disordered predicted regions are part of the signal peptide. The second as disordered predicted regions are part of a beta sheet.
POODLE-S
Predicted as disordered:
no. | AA | ORD/DIS/xray | Prob./xray | ORD/DIS/B-factor | Prob./B-factor |
1 | M | D | 0.764 | D | 1 |
2 | E | D | 0.707 | D | 0.901 |
3 | F | D | 0.731 | D | 0.882 |
4 | S | D | 0.759 | D | 0.875 |
5 | S | D | 0.764 | D | 0.847 |
6 | P | D | 0.75 | D | 0.825 |
7 | S | D | 0.782 | D | 0.811 |
8 | R | D | 0.751 | D | 0.791 |
9 | E | D | 0.707 | D | 0.711 |
10 | E | D | 0.692 | D | 0.661 |
11 | C | D | 0.679 | D | 0.568 |
12 | P | D | 0.66 | D | 0.596 |
13 | K | D | 0.633 | D | 0.557 |
14 | P | D | 0.607 | D | 0.517 |
15 | L | D | 0.588 | O | 0.493 |
16 | S | D | 0.55 | O | 0.489 |
17 | R | D | 0.523 | O | 0.424 |
18 | V | D | 0.534 | O | 0.391 |
... | |||||
100 | T | O | 0.368 | D | 0.571 |
101 | G | O | 0.29 | D | 0.54 |
102 | T | O | 0.243 | D | 0.555 |
... | |||||
360 | K | D | 0.527 | 0 | 0.133 |
361 | A | D | 0.534 | 0 | 0.138 |
362 | T | D | 0.577 | 0 | 0.162 |
363 | L | D | 0.594 | 0 | 0.162 |
364 | G | D | 0.513 | 0 | 0.163 |
POODLE-S also predicts a part of the signal peptide as disordered. It is based on looking for regions that are missing in the X-ray structure. Based on high B-factor region there are later also some disordered regions. The first one is not defined in Uniprot, the second one is part of a helix.
POODLE-L
POODLE-L predicted no disordered regions. So there are no disordered regions that are longer than 40 amino acids.
IUPRED
Prediction of transmembrane alpha-helices and signal peptides
General
Transmembrane topology
Signal peptides
Combined transmembrane and signal peptide prediction
The high similarity between the hydrophobic region of a transmembrane helix and the one of a signal peptide leads to cross-predictions when conventional transmembrane topology and signal peptide predictors as TMHMM and SignalP are used. Predictors which are based on submodels for both make less errors coming from cross-predictions and help to discriminate against false positives. Furthermore, a predicted signal peptide indicates that the N-terminus of the protein is non-cytoplasmic and is therefore helpful to assign the orientation of the protein. <ref>Käll L. Krogh A, & Sonnhammer, E. L. (2007) Advantages of combined tranasmembrane topology and signal peptide prediction - the Phobius web server. Nucleic Acids Res., Vol. 35, Web server issue, S.429-32</ref>
TMHMM
TMHMM is a method to predict transmembrane topology of membrane-spanning proteins. It is based on a hidden Markoc model with an architecture of 7 types of states (helix core, helic caps on both sides, one loop on the cytoplasmic side, two loops on the non-cytoplasmic side and a globular domain in the middle of each loop) which correspond to the biological system. The method was established by Sonnhammer et al. in 1998 <ref>Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998;6:175–182.</ref>.
Results
Output of TMHMM:
# sp|P04062|GLCM_HUMAN Length: 536
- sp|P04062|GLCM_HUMAN Number of predicted TMHs: 0
- sp|P04062|GLCM_HUMAN Exp number of AAs in TMHs: 1.77867
- sp|P04062|GLCM_HUMAN Exp number, first 60 AAs: 1.62607
- sp|P04062|GLCM_HUMAN Total prob of N-in: 0.06840
sp|P04062|GLCM_HUMAN TMHMM2.0 outside 1 536
TMHMM predicts no transmembrane segments for the sequence of glucocerebrosidase which is correct. Furthermore it localizes the protein correctly by indicating that it is non-cytosolic.
Phobius and PolyPhobius
Phobius
PolyPhobius
OCTOPUS and SPOCTOPUS
OCTOPUS
SOCTOPUS
SignalIP
TargetP
Prediction of GO terms
General
GOPET
Pfam
ProtFun 2.2
References
<references />