Difference between revisions of "Sequence-based predictions"
(→POODLE-S) |
(→TODO) |
||
(306 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | <sup>by [[User:Greil|Robert Greil]] and [[User:Landerer|Cedric Landerer]]</sup> |
||
− | =Sequence-based predictions= |
||
+ | =Secondary structure prediction= |
||
− | |||
+ | For the secondary structure prediction, we used a reference sequence obtained from UniProt. We also used the annotations of the UniProt sequence as a reference to compare the predicted secondary structures. |
||
− | ==1. Secondary structure prediction== |
||
===PSIPRED=== |
===PSIPRED=== |
||
− | [[File:Hfe psipred res coml.png|200px|thumb|right|Secondary Structure predicted by PSIPRED]] |
+ | [[File:Hfe psipred res coml.png|200px|thumb|right|Figure 1: Secondary Structure predicted by PSIPRED<br>Source: http://bioinf.cs.ucl.ac.uk/psipred/]] |
+ | |||
+ | PSI-PRED use the PSI-BLAST output as input for a neuronal network which has a single hidden layer and a feed-forward back-propagation architecture to predict the secondary structure. |
||
+ | <br><br> |
||
+ | '''Results'''<br> |
||
+ | PSI-PRED predicts a alpha/beta structure. The transmembrane region is predicted as a beta region. A graphical representation of the result is shown in Figure 1. |
||
+ | |||
PSIPRED HFORMAT (PSIPRED V3.0) |
PSIPRED HFORMAT (PSIPRED V3.0) |
||
− | Conf: 999851589999999877513567886245556456636899750389988756755687 |
+ | Conf: 999851589999999877513567886245556456636899750389988756755687 |
− | Pred: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE |
+ | Pred: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE |
− | AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
+ | AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
− | 10 20 30 40 50 60 |
+ | 10 20 30 40 50 60 |
− | Conf: 318998225536664688990669998865311211002358577441156788603899 |
+ | Conf: 318998225536664688990669998865311211002358577441156788603899 |
− | Pred: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE |
+ | Pred: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE |
− | AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
+ | AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
− | 70 80 90 100 110 120 |
+ | 70 80 90 100 110 120 |
− | Conf: 987799319835459889765910588728988756689786135787788899999876 |
+ | Conf: 987799319835459889765910588728988756689786135787788899999876 |
− | Pred: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH |
+ | Pred: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH |
− | AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
+ | AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
− | 130 140 150 160 170 180 |
+ | 130 140 150 160 170 180 |
− | Conf: 310271499889888616322000378810000468999601699981450765189996 |
+ | Conf: 310271499889888616322000378810000468999601699981450765189996 |
− | Pred: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE |
+ | Pred: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE |
− | AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
+ | AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
− | 190 200 210 220 230 240 |
+ | 190 200 210 220 230 240 |
− | Conf: 288106667520025355899875899999965999872169986699998826885259 |
+ | Conf: 288106667520025355899875899999965999872169986699998826885259 |
− | Pred: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC |
+ | Pred: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC |
− | AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
+ | AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
− | 250 260 270 280 290 300 |
+ | 250 260 270 280 290 300 |
− | Conf: 999711124320001367777622367764115889887620212359 |
+ | Conf: 999711124320001367777622367764115889887620212359 |
− | Pred: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC |
+ | Pred: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC |
− | AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
+ | AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
− | 310 320 330 340 |
+ | 310 320 330 340 |
===Jpred3=== |
===Jpred3=== |
||
+ | Jpred use the Jnet algorithm which provides "a three-state (a-helix, ß-strand and coil) prediction of secondary structure at an accuracy of 81.5%" <ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.abstract</ref>. |
||
+ | <br><br> |
||
+ | '''Results'''<br> |
||
+ | Jpred found in it's first blast search a lot of homologous hits with an e-value range from e-163 to 4e-44. There are some self hits included. We continued to the prediction which is: |
||
+ | |||
Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDD |
Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDD |
||
SS: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE-- |
SS: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE-- |
||
Line 44: | Line 55: | ||
Seq: TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPG |
Seq: TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPG |
||
− | SS:-EEEEEEE------EEEEEEE----------EE----------EEEEEEEEE--- |
+ | SS: -EEEEEEE------EEEEEEE----------EE----------EEEEEEEEE--- |
Seq: EEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILR |
Seq: EEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILR |
||
Line 52: | Line 63: | ||
===Comparison with DSSP=== |
===Comparison with DSSP=== |
||
+ | DSSP was designed by Wolfgang Kabsch and Chris Sander to provide a standard for the secondary structure assignment. DSSP calculates the secondary structure from PDB structures by using the distances between the atoms. |
||
− | ==2. Prediction of disordered regions== |
||
+ | <br><br> |
||
− | ===DISOPRED=== |
||
+ | '''Results'''<br> |
||
+ | Because, the PDB sequence is not complete, the dssp assignment is also incomplete. The interesting parts - the signal peptide and the cytoplasmic part - which are predicted as disordered are not covered by DSSP. PSIPRED and JPred predicted the transmembrane region well but assigned the - as disordered predicted - N- and C-terminus as a helical or beta sheet region. But the UniProt assignment gives no structure to this regions as well. Therefore, these regions may unstructured and not yet recognized as disordered regions. |
||
+ | |||
+ | UniProt: ---------------------------E<font color="#008000">EEEEEEEEE</font>E----EEE--<font color="#008000">EEEEEE</font>--<font color="#008000">EEEEE</font> |
||
+ | DSSP: --E<font color="#008000">EEEEEEEEE</font>B-SS-SSB--<font color="#008000">EEEEEE</font>TT<font color="#008000">EEEEE</font> |
||
+ | PSIPRED: CCCCCHHHHHHHHHHHHHHHCCCCCCCE<font color="#008000">EEEEEEEEE</font>ECCCCCCCEE<font color="#008000">EEEEEE</font>CC<font color="#008000">EEEEE</font> |
||
+ | JPred: ------HHHHHHHHHHHHH---------<font color="#008000">EEEEEEEEE</font>-------EEE<font color="#008000">EEEEEE</font>--<font color="#008000">EEEEE</font> |
||
+ | AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
+ | DSSPSeq: RSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
+ | 10 20 30 40 50 60 |
||
+ | UniProt: EEEEE--<font color="#008000">EEE</font>--------TTTH<font color="#008000">HHHHHHHHHH</font>HHHHH<font color="#008000">HHHHHHHHH</font>HTTT-EEE--<font color="#008000">EEEE</font> |
||
+ | DSSP: EESSS--<font color="#008000">EEE</font>-STTS-SSTTTT<font color="#008000">HHHHHHHHHH</font>HHHHH<font color="#008000">HHHHHHHHH</font>HTTT-SSS--<font color="#008000">EEEE</font> |
||
+ | PSIPRED: ECCCCCC<font color="#008000">EEE</font>CCCCCCCCCCHH<font color="#008000">HHHHHHHHHH</font>CCCCC<font color="#008000">HHHHHHHHH</font>HHCCCCCCCC<font color="#008000">EEEE</font> |
||
+ | JPred: E-----E<font color="#008000">EEE</font>----------HH<font color="#008000">HHHHHHHHHH</font>HHHHH<font color="#008000">HHHHHHHHH</font>---------E<font color="#008000">EEEE</font> |
||
+ | AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
||
+ | DSSPSEQ: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
||
+ | 70 80 90 100 110 120 |
||
+ | UniProt: <font color="#008000">EEEEE</font>E-----<font color="#008000">EEEEEEE</font>EE--E<font color="#008000">EEEEE</font>EHHH-EEEEEE---H<font color="#008000">HHHHHHH</font>---<font color="#008000">HHHHHHH</font> |
||
+ | DSSP: <font color="#008000">EEEEE</font>E-TTS-<font color="#008000">EEEEEEE</font>EETTE<font color="#008000">EEEEE</font>EGGGTEEEESSGGGH<font color="#008000">HHHHHHH</font>SST<font color="#008000">HHHHHHH</font> |
||
+ | PSIPRED: <font color="#008000">EEEEE</font>EECCCE<font color="#008000">EEEEEEE</font>EECCC<font color="#008000">EEEEE</font>CCCCCCCCCCCCCCH<font color="#008000">HHHHHHH</font>HHH<font color="#008000">HHHHHHH</font> |
||
+ | JPred: <font color="#008000">EEEEE</font>------<font color="#008000">EEEEEEE</font>-----<font color="#008000">EEEEE</font>E----EEE-------<font color="#008000">HHHHHHH</font>--H<font color="#008000">HHHHHHH</font> |
||
+ | AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
||
+ | DSSPSEQ: ILGaEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
||
+ | 130 140 150 160 170 180 |
||
+ | UniProt: <font color="#008000">HH</font>HH-HHH<font color="#008000">HHHHHHHH</font>HHTTT-------E<font color="#008000">EEE</font>EEEE----E<font color="#008000">EEEEEEE</font>EEEEE--<font color="#008000">EEEEE</font> |
||
+ | DSSP: <font color="#008000">HH</font>HHTHHH<font color="#008000">HHHHHHHH</font>HHTTTSS--B--E<font color="#008000">EEE</font>EEEE-SS-E<font color="#008000">EEEEEEE</font>EEBSS--<font color="#008000">EEEEE</font> |
||
+ | PSIPRED: <font color="#008000">HH</font>HCCCHH<font color="#008000">HHHHHHHH</font>CCCCCCCCCCCCC<font color="#008000">EEE</font>ECCCCCCCE<font color="#008000">EEEEEEE</font>EECCCCE<font color="#008000">EEEEE</font> |
||
+ | JPred: <font color="#008000">HH</font>------<font color="#008000">HHHHHHHH</font>HH-H-------EE<font color="#008000">EEE</font>---------<font color="#008000">EEEEEEE</font>------E<font color="#008000">EEEEE</font> |
||
+ | AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
||
+ | DSSPSEQ: AYLERDaPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRbRALNYYPQNITMKWL |
||
+ | 190 200 210 220 230 240 |
||
+ | UniProt: <font color="#008000">E</font>------HHH----EEEE-----<font color="#008000">EEEEEEEEE</font>---HHHH<font color="#008000">EEEEEE</font>---EEE-<font color="#008000">EEE</font>E---- |
||
+ | DSSP: <font color="#008000">E</font>TTEE--GGGS---EEEE-TTS-<font color="#008000">EEEEEEEEE</font>-TTGGGG<font color="#008000">EEEEEE</font>-TTSSS-<font color="#008000">EEE</font>E- |
||
+ | PSIPRED: <font color="#008000">E</font>CCEECCCCCCCCCCCEECCCCC<font color="#008000">EEEEEEEEE</font>CCCCCCC<font color="#008000">EEEEEE</font>CCCCCCC<font color="#008000">EEE</font>EEECC |
||
+ | JPred: <font color="#008000">E</font>----------EE----------<font color="#008000">EEEEEEEEE</font>------E<font color="#008000">EEEEEE</font>E------<font color="#008000">EEE</font>EE--- |
||
+ | AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
||
+ | DSSPSEQ: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTbQVEHPGLDQPLIVIW |
||
+ | 250 260 270 280 290 300 |
||
+ | UniProt: ------------------------------------------------ |
||
+ | DSSP: |
||
+ | PSIPRED: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC |
||
+ | JPred: ------HHHHHHHHHHHHHHHHHHHHHHHHHH---------------- |
||
+ | AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
||
+ | DSSPSEQ: |
||
+ | 310 320 330 340 |
||
+ | |||
+ | We see a good overlap of the results of all the different methods except for the region at the C-terminus. Here, the PDB file did not provide a sequence, and UniProt did not assign a secondary structure to this region. But interestingly, PSIPRED predict a beta-sheet in this region while JPred predict a helix. As we will show later, this region is proposed to be disordered by the most prediction tools. This could be the reason for the different assignments. |
||
+ | |||
+ | =Prediction of disordered regions= |
||
+ | The HFE-Gen is not yet known as disordered. It is not contained in the Disprot<ref>http://www.disprot.org/</ref> database. |
||
+ | The prediction of unstructured regions predict several disordered regions in the protein, but most of them are predicted within secondary structure elements. Just the predicted disordered regions at the C- and N-terminus might be really unstructured but not yet experimentally recognized because, these regions have no structural assignment. But based on this, we can not guess which tool works best for our case. |
||
+ | <br><br> |
||
+ | The predictions are shown below. |
||
+ | |||
+ | ==DISOPRED== |
||
+ | |||
+ | |||
+ | For the prediction, we used the DISOPRED-Server at http://bioinf.cs.ucl.ac.uk/disopred/ <br> |
||
+ | DISOPRED is a prediction tool for disordered regions based on a linear SVM. The SVM is trained with 750 non-redundant sequences with high resolution X-ray structures. "Disorder was identified with those residues that appear in the sequence records but with coordinates missing from the electron density map." <ref>http://bioinf.cs.ucl.ac.uk/index.php?id=806</ref> For each protein, a sequence profile was generated by using PSI-BLAST search against a filtered database. The PSI-BLAST profiles were used as input vectors for the SVM. |
||
+ | <br><br> |
||
+ | '''Result'''<br> |
||
+ | Disopred predictes two disordered residues at the signal peptide and a disordered region at the end of the sequence which is located inside the cell. |
||
+ | [[File:Disopred_profile_hfe.png|200px|thumb|right|Figure 2: DISOPRED prediction profile for the HFE protein<br>Source: http://bioinf.cs.ucl.ac.uk/disopred/]] |
||
AA:Target sequence |
AA:Target sequence |
||
Pred:Residue disorder prediction(.)= ordered residue(*)=Disordered residue |
Pred:Residue disorder prediction(.)= ordered residue(*)=Disordered residue |
||
Line 83: | Line 157: | ||
DISOPRED predictions for a false positive rate threshold of: 2% |
DISOPRED predictions for a false positive rate threshold of: 2% |
||
− | + | ==POODLE== |
|
POODLE stands for ''Prediction Of Order and Disorder by machine LEarning''. |
POODLE stands for ''Prediction Of Order and Disorder by machine LEarning''. |
||
+ | |||
+ | POODLE provides three different predictions |
||
+ | * POODLE-S: short disorder regions prediction |
||
+ | * POODLE-L: long disorder regions prediction (longer 40 residues) |
||
+ | * unfolded protein prediction |
||
+ | <br> |
||
+ | All POODLE variants predict a disordered region at the end of the protein which contains a transmembrane region (pos: 307-330), this shows an evidence for a disordered region at the C-Terminus. But also, all variants predict a short disordered region at the beginning of the sequence which is a part of the signal peptide (pos: 1-22). The score distribution over the sequence is shown in the Figures 3 to 6. The threshold to distinguish between an ordered and a disordered region is 0.5 |
||
+ | |||
+ | ===POODLE-I=== |
||
+ | POODLE-I (series only) predicted 4 disordered regions within the protein sequence. |
||
+ | |||
+ | [[File:Poodle_I.PNG|200px|thumb|right|Figure 3: Distribution of disordered region over the AS-Sequence predicted by POODLE-I<br>Source: http://mbs.cbrc.jp/poodle/poodle.html]] |
||
+ | |||
+ | MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
+ | **************---------------------------------------------- |
||
+ | YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
||
+ | -------**********---******------*--------------------------- |
||
+ | ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
||
+ | ------------------------------------------------------------ |
||
+ | AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
||
+ | ---------------------***************------------------------ |
||
+ | KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
||
+ | ----*********----------------------------------------******* |
||
+ | PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
||
+ | ************************************************ |
||
===POODLE-S=== |
===POODLE-S=== |
||
− | POODLE-S (using missing residues) |
+ | POODLE-S (using missing residues) predicts 6 short disordered regions within the protein sequence. |
− | [[File:Poodle_S.PNG|200px|thumb|right|Distribution of |
+ | [[File:Poodle_S.PNG|200px|thumb|right|Figure 4: Distribution of disordered region over the AS-Sequence predicted by POODLE-S(Missing residues)<br>Source: http://mbs.cbrc.jp/poodle/poodle.html]] |
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
Line 104: | Line 203: | ||
*--------------------------------********------- |
*--------------------------------********------- |
||
− | POODLE-S (using High B-Factor residues) |
+ | POODLE-S (using High B-Factor residues) predicts 2 short disordered regions within the protein sequence. |
− | [[File:Poodle_S_B.PNG|200px|thumb|right|Distribution of |
+ | [[File:Poodle_S_B.PNG|200px|thumb|right|Figure 5: Distribution of disordered region over the AS-Sequence predicted by POODLE-S(High B-Factor residues)<br>Source: http://mbs.cbrc.jp/poodle/poodle.html]] |
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
Line 122: | Line 221: | ||
===POODLE-L=== |
===POODLE-L=== |
||
+ | POODLE-L predicts a disordered region from 296 to the end. |
||
− | [[File:Poodle_L.PNG|200px|thumb|right|Distribution of disordert region over the AS-Sequence predicted by POODLE-L]] |
||
+ | [[File:Poodle_L.PNG|200px|thumb|right|Figure 6: Distribution of disordered region over the AS-Sequence predicted by POODLE-L<br>Source: http://mbs.cbrc.jp/poodle/poodle.html]] |
||
− | POODLE-L predicted a disorderd region from 296 to the end. |
||
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
Line 139: | Line 238: | ||
************************************************ |
************************************************ |
||
+ | ==IUPRED== |
||
− | ==3. Prediction of transmembrane alpha-helices and signal peptides== |
||
+ | IUPRED use the estimated pairwise energy to recognize unstructured regions within protein sequences. For these, they use the assumption, that all globular proteins have an amino acid composition which gives it the potential to form a large number of favorable interactions. The score distribution over the sequence is shown in the Figures 7 to 9. IUPRED use 0.5 as threshold to distinguish between a disordered and a ordered region. |
||
− | ===TMHMM=== |
||
+ | <br><br> |
||
− | ===Phobius and PolyPhobius=== |
||
+ | '''Results'''<br> |
||
− | ===OCTOPUS and SPOCTOPUS=== |
||
+ | The short term prediction predicts 5 short regions. There are also disordered residues at the beginning and in the signal peptide. |
||
− | ===SignalP=== |
||
+ | [[File:Uipred short.PNG|200px|thumb|right|Figure 7: IUPRED prediction of short regions<br>Source: http://iupred.enzim.hu/]] |
||
− | ===TargetP=== |
||
+ | [[File:Uipred_long.PNG|200px|thumb|right|Figure 8: IUPRED prediction of long regions<br>Source: http://iupred.enzim.hu/]] |
||
− | ==4. Prediction of GO terms== |
||
+ | [[File:Uipred_structured.PNG|200px|thumb|right|Figure 9: IUPRED prediction of structured regions<br>Source: http://iupred.enzim.hu/]] |
||
− | ===Generel=== |
||
+ | |||
− | HFE is annotated with 27 different GO Terms which are <ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=Q30201</ref>: |
||
+ | MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
+ | ***--------------------------------------------------------- |
||
+ | YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
||
+ | ------------------------------------------------------------ |
||
+ | ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
||
+ | ------------------------------------------------------------ |
||
+ | AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
||
+ | ------------------------------------------------------------ |
||
+ | KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
||
+ | ---------********----------***--------*-****---------------- |
||
+ | PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
||
+ | ---------------------------------------------*** |
||
+ | |||
+ | |||
+ | The long term prediction predicted 7 disordered residues, but just one short region. |
||
+ | |||
+ | MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF |
||
+ | ------------------------------------------------------------ |
||
+ | YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV |
||
+ | ------------------------------------------------------------ |
||
+ | ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR |
||
+ | ------------------------------------------------------------ |
||
+ | AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL |
||
+ | ------------------------------------------------------------ |
||
+ | KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS |
||
+ | ---------******-------------------------*------------------- |
||
+ | PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE |
||
+ | ------------------------------------------------ |
||
+ | |||
+ | |||
+ | <br> |
||
+ | <br> |
||
+ | The prediction of sturcured regions predicts one globular domain from 1-348 (Figure 9). This means, that the whole protein is structured. This is a contradiction to the prediction of POODLE, but because of the weak evidence given by the other IUPRED-methods not a real contradiction to the other results of IUPRED. |
||
+ | |||
+ | ==META-Disorder== |
||
+ | |||
+ | For this task, we used the PredictProtein Server at [https://www.predictprotein.org https://www.predictprotein.org]. META-Disorder, published in 2009 by Avner Schlessinger, Marco Punta, Guy Yachdav, Laszlo Kajan and Burkhard Rost, use a combined prediction of ORSnet PROFbval and Ucon. |
||
+ | |||
+ | '''predicted secondary structure composision''' |
||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | !sec str type |
||
+ | !H |
||
+ | !E |
||
+ | !L |
||
+ | |- |
||
+ | |% in protein || 27.30 || 28.74 || 43.97 |
||
+ | |- |
||
+ | |} |
||
+ | <br> |
||
+ | '''Prediction''' of disordered residues by META-Disorder (last coloumn) |
||
+ | Number Residue NORSnet NORS2st PROFbval bval2st Ucon Ucon2st MD_raw MD_rel MD2st |
||
+ | |||
+ | .... |
||
+ | 242 D 0.13 - 0.70 D 0.76 D 0.444 2 - |
||
+ | 243 K 0.13 - 0.69 D 0.76 D 0.480 1 - |
||
+ | 244 Q 0.13 - 0.66 D 0.93 D 0.531 0 D |
||
+ | 245 P 0.17 - 0.73 D 0.92 D 0.520 0 D |
||
+ | 246 M 0.28 - 0.65 D 0.90 D 0.525 0 D |
||
+ | 247 D 0.31 - 0.68 D 0.87 D 0.515 0 - |
||
+ | 248 A 0.36 - 0.70 D 0.87 D 0.520 0 D |
||
+ | 249 K 0.40 - 0.69 D 0.89 D 0.485 1 - |
||
+ | .... |
||
+ | 344 L 0.37 - 0.59 D 0.18 - 0.515 0 - |
||
+ | 345 A 0.35 - 0.74 D 0.17 - 0.515 0 - |
||
+ | 346 E 0.31 - 0.89 D 0.17 - 0.520 0 D |
||
+ | 347 R 0.35 - 0.91 D 0.23 - 0.525 0 D |
||
+ | 348 E 0.34 - 0.92 D 0.38 - 0.520 0 D |
||
+ | |||
+ | Key for output |
||
+ | ---------------- |
||
+ | Number - residue number |
||
+ | Residue - amino-acid type |
||
+ | NORSnet - raw score by NORSnet (prediction of unstructured loops) |
||
+ | NORS2st - two-state prediction by NORSnet; D=disordered |
||
+ | PROFbval - raw score by PROFbval (prediction of residue flexibility from sequence) |
||
+ | Bval2st - two-state prediction by PROFbval |
||
+ | Ucon - raw score by Ucon (prediction of protein disorder using predicted internal contacts) |
||
+ | Ucon2st - two-state prediction by Ucon |
||
+ | MD - raw score by MD (prediction of protein disorder using orthogonal sources) |
||
+ | MD_rel - reliability of the prediction by MD; values range from 0-9. 9=strong prediction |
||
+ | MD2st - two-state prediction by MD |
||
+ | |||
+ | META-Disorder predict a very short disordered region of 3 residues at the end of the protein but with a week evidence of around 0.5. Therefore it is quite unlikely to have a disordered region at the C-Terminus if we just look at this method. |
||
+ | |||
+ | ==Discussion== |
||
+ | Most tools predict a disordered region at the C-terminus of the protein. The predicted region is not part of the PDB structure and mostly a transmembrane region. The major issue by rating this prediction as correct, is that the annotation specially for this region is incomplete. But as this is a transmembrane region, it is not surprising, that this part is not annotated. As a disordered region is a region which only forms a structure by interacting with an antagonist (like another protein or DNA or just small molecules) it is possible, that the region is a disordered one, which forms a structure by getting imported into the membrane. But it is also possible, that a defined structure exists and just because it is part of a membrane region, witch are hard to solve, it is not known yet. A part which afflicted us by doubts is, that the disordered regions predicted within the protein are mostly loops. Without any further evidence, and because just one tool predict us a larger disordered region at the C-terminus, we would assume the protein as not disordered. |
||
+ | |||
+ | =Prediction of transmembrane alpha-helices and signal peptides= |
||
+ | ===General=== |
||
+ | We were given five additional proteins to work with and predict transmembrane regions, signal peptides and GO terms for. That was done, because most of the practicals proteins are no membrane proteins and therefore produce only "no membrane" results. Thus the three membrane proteins [[http://www.uniprot.org/uniprot/P02945.fasta BACR_HALSA]], [[http://www.uniprot.org/uniprot/P11279.fasta LAMP1_HUMAN]] and [[http://www.uniprot.org/uniprot/P05067.fasta A4_HUMAN]] were provided, but also our HFE Protein [[http://www.uniprot.org/uniprot/Q30201.fasta HFE_HUMAN]] is an membrane protein. |
||
+ | |||
+ | To give you a quick overview about the protein properties, look at the following table: |
||
+ | {| border="1" style="text-align:center;" |
||
+ | |- |
||
+ | |Accession |
||
+ | |Entry name |
||
+ | |Organism |
||
+ | |Subcelluar location |
||
+ | |Signal peptide |
||
+ | |- |
||
+ | |Q30201 |
||
+ | |HFE_HUMAN |
||
+ | |Homo sapiens (Human) |
||
+ | |Membrane; Single-pass type I membrane protein |
||
+ | |1-22 |
||
+ | |- |
||
+ | |P02945 |
||
+ | |BACR_HALSA |
||
+ | |Halobacterium salinarium / (Halobacterium halobium) |
||
+ | |Cell membrane; Multi-pass membrane protein |
||
+ | |no |
||
+ | |- |
||
+ | |P02753 |
||
+ | |RET4_HUMAN |
||
+ | |Homo sapiens (Human) |
||
+ | |Secreted |
||
+ | |1-18 |
||
+ | |- |
||
+ | |Q9Y5Q6 |
||
+ | |INSL5_HUMAN |
||
+ | |Homo sapiens (Human) |
||
+ | |Secreted |
||
+ | |1-22 |
||
+ | |- |
||
+ | |P11279 |
||
+ | |LAMP1_HUMAN |
||
+ | |Homo sapiens (Human) |
||
+ | |Cell membrane; Single-pass type I membrane protein [...] |
||
+ | |1-28 |
||
+ | |- |
||
+ | |P05067 |
||
+ | |A4_HUMAN |
||
+ | |Homo sapiens (Human) |
||
+ | |Membrane; Single-pass type I membrane protein |
||
+ | |1-17 |
||
+ | |- |
||
+ | |} |
||
+ | |||
+ | We are going to predict membranes and signaling for these six proteins using different tools. Because our normally addressed protein HFE_HUMAN is an membrane protein and therefore we see the prediction accuracy by using it, we will give only graphical and detailed overview about the results of HFE_HUMAN and group the additional proteins in textual form. |
||
+ | |||
+ | We use the entries at UniProt for the real ground truth and compare the prediction results shortly with them. |
||
+ | |||
+ | |||
+ | Why is the prediction of transmembrane helices and signal peptides grouped together here? |
||
+ | |||
+ | Transmembrane helices and signal peptides are very hard to differ, if the method only predicts one of them. Therefore many false-positives are produced, because both of them consist almost only out of hydrophobic residues. Because of that, many methods gone the way, to combine the prediction of transmembrane helices and signal peptides, reducing drastically the false-positives. |
||
+ | |||
+ | There are different types of signal peptides, but they all work as an import or export information. |
||
+ | * Import into the peroxisome |
||
+ | * Import into the nucleus |
||
+ | * Export from the nucleus |
||
+ | * Import into the mitochondrium |
||
+ | |||
+ | ==TMHMM== |
||
+ | [[File:Hfe_t3_tmhmm.jpg|400px|thumb|right|Figure 10: TMHMM posterior probabilities]] |
||
+ | TMHMM is a tool for predicting membrane topology (transmembrane helices) in proteins based on a hidden Markov model with different states. |
||
+ | It divides the regions in "inside", "outside" and "TMhelix". But TMHMM can not predict Signal Peptides |
||
+ | |||
+ | TMHMM was used locally in our linux box, after correcting some path issues inside some config files. |
||
+ | |||
+ | The command we used was: |
||
+ | * tmhmm x.fasta > x.tmhmm |
||
+ | |||
+ | where 'x' stands for one of the UniProt entry name of the proteins. Afterwards we tried to plot the result of HFE_HUMAN with gnuplot, but this was not working either, because of path issues inside the (automatically created) gnuplot script of tmhmm. After correcting the path issues again, gnuplot worked fine and produced successfully graphical output. |
||
+ | |||
+ | A color code is applied to provide easier reading: |
||
+ | * green: The predicted region matches the experimental resolved region from UniProt (+/- 5 residues allowed). TMHMM succeeded. |
||
+ | * yellow: The predicted region does only partially match the experimental region (mostly errors with signaling peptides). TMHMM has to be improved. |
||
+ | * red: The predicted region does absolutely not match the experimental region. TMHMM failed. |
||
+ | |||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |sp|Q30201|HFE_HUMAN ||rowspan=2|TMHMM2.0 || bgcolor=yellow rowspan=2|outside || rowspan=2|1 || rowspan=2|306 ||Signal peptide || 1 || 22 |
||
+ | |- |
||
+ | |sp|Q30201|HFE_HUMAN || Extracelluar ||23 || 306 |
||
+ | |- |
||
+ | |sp|Q30201|HFE_HUMAN || TMHMM2.0 || bgcolor=lightgreen| TMhelix || 307 || 329 ||Helical || 307 || 330 |
||
+ | |- |
||
+ | |sp|Q30201|HFE_HUMAN || TMHMM2.0 || bgcolor=lightgreen| inside || 330 || 348 ||Cytoplasmic || 331 || 348 |
||
+ | |- |
||
+ | |} |
||
+ | |||
+ | TMHMM misses clearly the the signal peptide and counts the region as outside (1-306), which is correct according to UniProt. Also the TMhelix (307-329) and the inside region (330-348) is placed right, only with one amino acid deviation, but that is insignificant. Therefore TMHMM was very successful in prediction of right regions. The results are shown in Figure 10. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|outside || 1 || 22||Extracellular||14||23 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 23 || 42||Helical; Name=Helix A||24||42 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|inside || 43 || 54||Cytoplasmic||43||56 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 55 || 77||Helical; Name=Helix B||57||75 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|outside || 78 || 91||Extracellular||76||91 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 92 || 114||Helical; Name=Helix C||92||109 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|inside || 115 || 120||Cytoplasmic||110||120 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 121 || 143||Helical; Name=Helix D||121||140 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|outside || 144 || 147||Extracellular||141||147 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 148 || 170||Helical; Name=Helix E||148||167 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|inside || 171 || 189||Cytoplasmic||168||185 |
||
+ | |- |
||
+ | |sp_P02945_BACR_HALSA || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 190 || 212||Helical; Name=Helix F||186||204 |
||
+ | |- |
||
+ | |rowspan=3|sp_P02945_BACR_HALSA || rowspan=3|TMHMM2.0 || bgcolor=red rowspan=3|outside || rowspan=3|213 || rowspan=3|262 ||Extracellular||205||216 |
||
+ | |- |
||
+ | |Helical; Name=Helix G||217||236 |
||
+ | |- |
||
+ | |Cytoplasmic||237||262 |
||
+ | |} |
||
+ | |||
+ | As clearly visible, TMHMM is also capable of prediction regions of non eukaryots almost completely correctly. Only the last helical and cytoplasmic regions are missed. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |sp_P02753_RET4_HUMAN || TMHMM2.0 || bgcolor=lightgreen|outside || 1 || 201 ||Signal peptide||1||18 |
||
+ | |} |
||
+ | |||
+ | TMHMM misses the signaling peptide but everything else is superb, because RET4_Human is a protein that gets secreted. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |sp_Q9Y5Q6_INSL5_HUMAN || TMHMM2.0 || bgcolor=lightgreen|outside || 1 || 135||Signal peptide||1||22 |
||
+ | |} |
||
+ | |||
+ | As usual, the TMHMM misses the signaling peptide but predicts accurately. INSL5_Human is a hormone and thus secreted at the extracellular regions. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |sp_P11279_LAMP1_HUMAN || TMHMM2.0 || bgcolor=yellow|inside || 1 || 10||rowspan=2|Signal peptide||rowspan=2|1||rowspan=2|28 |
||
+ | |- |
||
+ | |sp_P11279_LAMP1_HUMAN || TMHMM2.0 || bgcolor=red|TMhelix || 11 || 33 |
||
+ | |- |
||
+ | |sp_P11279_LAMP1_HUMAN || TMHMM2.0 || bgcolor=red|outside || 34 || 383||Lumenal||29||382 |
||
+ | |- |
||
+ | |sp_P11279_LAMP1_HUMAN || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 384 || 406||Helical||383||405 |
||
+ | |- |
||
+ | |sp_P11279_LAMP1_HUMAN || TMHMM2.0 || bgcolor=lightgreen|inside || 407 || 417||Cytoplasmic||406||417 |
||
+ | |} |
||
+ | |||
+ | TMHMM does not so well predict this protein. It mixes up the region of the signaling peptide with 'inside' and 'TMhelix' and is completely wrong at lumenal with 'outside'. The rest is predicted well. |
||
+ | |||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | ! |
||
+ | !colspan=4|TMHMM |
||
+ | !colspan=3|UniProt |
||
+ | |- |
||
+ | !id |
||
+ | !version |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | !region |
||
+ | !start |
||
+ | !end |
||
+ | |- |
||
+ | |rowspan=2|sp_P05067_A4_HUMAN || rowspan=2|TMHMM2.0 || bgcolor=yellow rowspan=2|outside || rowspan=2|1 || rowspan=2|700||Signal peptide||1||17 |
||
+ | |- |
||
+ | |Extracellular||18||699 |
||
+ | |- |
||
+ | |sp_P05067_A4_HUMAN || TMHMM2.0 || bgcolor=lightgreen|TMhelix || 701 || 723||Helical||700||723 |
||
+ | |- |
||
+ | |sp_P05067_A4_HUMAN || TMHMM2.0 || bgcolor=lightgreen|inside || 724 || 770||Cytoplasmic||724||770 |
||
+ | |} |
||
+ | |||
+ | TMHMM predicted again almost correctly and missed only the beginning unpredictable signaling peptide. |
||
+ | |||
+ | ==Phobius and PolyPhobius== |
||
+ | For Phobius and PolyPhobius, we used the webservice<ref>http://www.ncbi.nlm.nih.gov/pubmed/17483518?dopt=Abstract</ref> at [http://phobius.sbc.su.se/ http://phobius.sbc.su.se/] with standard settings. |
||
+ | |||
+ | '''Phobius''' is a combined predictor for transmembrane protein topology and signal peptide. Phobius models different regions of the sequence in a series of interconnected states of a HMM.<ref>http://www.ncbi.nlm.nih.gov/pubmed/15111065?dopt=Abstract</ref> |
||
+ | <br> |
||
+ | '''PolyPhobius''' is a hidden Markov model (HMM) decoding algorithm. It combines probabilities for sequence features of homologous by considering the average of the posterior label probability of each position in a global sequence alignment. PolyPhobius is benchmarked by Phobius. <ref>http://www.ncbi.nlm.nih.gov/pubmed/15961464?dopt=Abstract</ref> |
||
+ | |||
+ | ===Phobius=== |
||
+ | [[File:Phobius.PNG|200px|thumb|right|Figure 11: predicted regions by Phobius<br> Source: http://phobius.sbc.su.se/]] |
||
+ | |||
+ | Phobius predicts very accurate as seen below. The transmembrane region is predicted just 1-2 residues upstream from the annotated region. The same holds for the topological domains before and after the transmembrane region. Also the signal peptide is correctly predicted. The probability distribution is shown in Figure 11. |
||
+ | |||
+ | PREDICTED ANNOTATION |
||
+ | ID sp|Q30201|HFE_HUMAN |
||
+ | FT SIGNAL 1 21 | 1-20 |
||
+ | FT REGION 1 7 N-REGION. |
||
+ | FT REGION 8 16 H-REGION. |
||
+ | FT REGION 17 21 C-REGION. |
||
+ | FT TOPO_DOM 22 304 NON CYTOPLASMIC. | 23-306 |
||
+ | FT TRANSMEM 305 329 | 307-330 |
||
+ | FT TOPO_DOM 330 348 CYTOPLASMIC. | 331-348 |
||
+ | |||
+ | ===PolyPhobius=== |
||
+ | [[File:Poly_phobius.PNG|200px|thumb|right|Figure 12: predicted regions by PolyPhobius<br> Source: http://phobius.sbc.su.se/]] |
||
+ | |||
+ | PolyPhobius also predicts very accurate but in our case not as accurate as Phobius. But this is just a small difference. The probability distribution is shown in Figure 12. |
||
+ | |||
+ | PREDICTED ANNOTATION |
||
+ | ID sp|Q30201|HFE_HUMAN |
||
+ | FT SIGNAL 1 23 | 1-20 |
||
+ | FT REGION 1 5 N-REGION. |
||
+ | FT REGION 6 19 H-REGION. |
||
+ | FT REGION 20 23 C-REGION. |
||
+ | FT TOPO_DOM 24 304 NON CYTOPLASMIC. | 23-306 |
||
+ | FT TRANSMEM 305 329 | 307-330 |
||
+ | FT TOPO_DOM 330 348 CYTOPLASMIC. | 331-348 |
||
+ | |||
+ | |||
+ | '''Additional proteins''' |
||
+ | * BACR_HALSA: Phobius/Polyphobius are almost the same. There is only a slight change in the domain length, furthermore both methods predicted the membrane topologies right. |
||
+ | * RET4_HUMAN: Phobius/Polyphobius predicted the position of the signaling peptide correct, the overall prediction is correct, too. |
||
+ | * INSL5_HUMAN: Phobius/Polyphobius predict again very accurate and correct. The signaling peptide and extracellular region are at the correct positions. |
||
+ | * LAMP1_HUMAN: Phobius/Polyphobius predicted the membrane topology correctly. |
||
+ | * A4_HUMAN: Phobius/Polyphobius predicted signaling peptide position is correct. |
||
+ | |||
+ | Summing up, Phobius/Polyphobius are very accurate at their predictions about the membrane topology. Polyphobius takes much more time to produce an result, but provide no really better result - sometimes even a bit worser. Because of that, we decided that Phobious is the more comfortable one. |
||
+ | |||
+ | ==OCTOPUS and SPOCTOPUS== |
||
+ | '''OCTOPUS''' is a combined method of HMM's and artificial neural networks. OCTOPUS first create a sequence profile by homology search using BLAST. The profile is used as the input to a set of neural networks which predict the preference of the location for each residue. Each residue is predicted to be either inside or outside the cell and located in a transmembrane (M), interface (I), close loop (L) or globular loop (G) environment. |
||
+ | <br> |
||
+ | '''SPOCTOPUS''' is an extended version of OCTOPUS that can also predict signal peptides. It use a neural network to predict a signal peptide if the score for each of the 70 N-Terminal residues is high enough. |
||
+ | |||
+ | |||
+ | Both, OCTOPUS and SPOCTOPUS predict the signal peptide and the transmembrane region correctly as you can see in the images below. Also both methods predict a signal peptide at the N-terminus which has the correct length. Figure 13 and Figure 14 show the prediction for the HFE protein. |
||
+ | {| class="centered" |
||
+ | |[[File:Octopus_g4.PNG|200px|thumb|right|Figure 13: predicted regions by OCTOPUS<br>Source: http://octopus.cbr.su.se/]] |
||
+ | |[[File:Spoctopus_g4.PNG|200px|thumb|right|Figure 14: predicted regions by SPOCTOPUS<br>Source: http://octopus.cbr.su.se/]] |
||
+ | |} |
||
+ | |||
+ | '''Additional proteins''' |
||
+ | The probability distribution for the additional proteins is shown in Figures 15 to 19 below. |
||
+ | |||
+ | {| class="centered" |
||
+ | |[[File:T3_bacr_halsa_octo_spocto.png|150px|thumb|right|Figure 15: BACR_HALSA by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php]] |
||
+ | |[[File:T3_ret4_human_octo_spocto.png|150px|thumb|right|Figure 16: RET4_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php]] |
||
+ | |[[File:T3_insl5_human_octo_spocto.png|150px|thumb|right|Figure 17: INSL5_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php]] |
||
+ | |[[File:T3_lamp1_human_octo_spocto.png|150px|thumb|right|Figure 18: LAMP1_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php]] |
||
+ | |[[File:T3_a4_human_octo_spocto.png|150px|thumb|right|Figure 19: A4_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php]] |
||
+ | |} |
||
+ | |||
+ | * BACR_HALSA: |
||
+ | The prediction result of OCTOPUS and SPOCTOPUS is the same. They are both correct, because BACR_HALSA does not contain any signal peptides. |
||
+ | * RET4_HUMAN: |
||
+ | OCTOPUS predicts an TM-helix instead of the signal peptide, SPOCTOPUS corrects this error. |
||
+ | *INSL5_HUMAN: |
||
+ | Same error for OCTOPUS as seen by RET4_HUMAN, SPOCTOPUS corrects this error again. |
||
+ | *LAMP1_HUMAN: |
||
+ | OCTOPUS and SPOCTOPUS predict both the TM-helix correctly. But same error for OCTOPUS as seen with RET4_HUMAN, INSL5_HUMAN: instead of signaling peptide an TM-helix and SPOCTOPUS corrects that. |
||
+ | *A4_HUMAN: |
||
+ | OCTOPUS predicts an Reentrant/Dip region instead of the signal peptide region, SPOCTOPUS corrects that. |
||
+ | |||
+ | ==SignalP== |
||
+ | For using it locally at our linux box, we had to correct again some path issues. |
||
+ | |||
+ | The command we used was: |
||
+ | |||
+ | * signalp -t y x.fasta > x.signalp |
||
+ | |||
+ | where 'x' stands again for the UniProt entry names of the proteins. 'y' was chosen accordingly to the organism of the protein, for all human proteins 'y' was set to eukaryotes 'euk' and for the bacterial protein P02945 to gram- 'gram-'. This switch specifies the neural network and hidden Markov models, that are separately trained for different organisms. |
||
+ | |||
+ | For the graphical output of HFE_HUMAN we used the SignalP server from: http://www.cbs.dtu.dk/services/SignalP |
||
+ | |||
+ | There are three scorings for the SignalP-prediction NN shown in Figure 20: |
||
+ | * C-score: 'cleavage site': raw cleavage site prediction |
||
+ | * S-mean-score: 'average of the S-score': discrimination of secretory and non-secretory proteins |
||
+ | * Y-max-score: 'combination of C-score with s-core': better cleavage site prediction |
||
+ | |||
+ | [[File:Hfe_t3_slp_nn.gif|300px|thumb|right|Figure 20: SignalP-NN prediction (source: signalp)]] |
||
+ | |||
+ | SignalP-NN result: |
||
+ | sp|Q30201|HFE_HUMAN length = 348 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 23 0.534 0.32 YES |
||
+ | max. Y 23 0.599 0.33 YES |
||
+ | max. S 16 0.995 0.87 YES |
||
+ | mean S 1-22 0.935 0.48 YES |
||
+ | D 1-22 0.767 0.43 YES |
||
+ | Most likely cleavage site between pos. 22 and 23: LQG-RL |
||
+ | |||
+ | [[File:Hfe_t3_slp_hmm.gif|300px|thumb|right|Figure 21: SignalP-HMM prediction (source: signalp)]] |
||
+ | |||
+ | |||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp|Q30201|HFE_HUMAN |
||
+ | Prediction: Signal peptide |
||
+ | Signal peptide probability: 0.998 |
||
+ | Signal anchor probability: 0.000 |
||
+ | Max cleavage site probability: 0.297 between pos. 22 and 23 |
||
+ | |||
+ | SignalP predicts an signal peptide probability with almost 1.0 and thus an signal anchor probability with 0. This leads to the prediction of an cleavage site between pos. 22 and 23 (Figure 21). |
||
+ | |||
+ | According to UniProt is there an signal peptide, it starts at pos. 1 to 22, which means, SignalP has predicted the signal peptide and cleavage site with 100% accuracy. |
||
+ | |||
+ | |||
+ | SignalP-NN result: |
||
+ | >sp_P02945_BACR_HALSA length = 70 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 16 0.331 0.52 NO |
||
+ | max. Y 43 0.066 0.33 NO |
||
+ | max. S 32 0.948 0.92 YES |
||
+ | mean S 1-42 0.216 0.49 NO |
||
+ | D 1-42 0.141 0.44 NO |
||
+ | Most likely cleavage site between pos. 42 and 43: FLV-KG |
||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp|P02945|BACR_HALSA |
||
+ | Prediction: Non-secretory protein |
||
+ | Signal peptide probability: 0.000 |
||
+ | Max cleavage site probability: 0.000 between pos. 15 and 16 |
||
+ | |||
+ | For BACR_Halsa is the result of SignalP clearly wrong, because this bacteria does not contain any signal peptide. |
||
+ | |||
+ | |||
+ | SignalP-NN result: |
||
+ | >sp_P02753_RET4_HUMAN length = 70 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 19 0.929 0.32 YES |
||
+ | max. Y 19 0.901 0.33 YES |
||
+ | max. S 1 0.994 0.87 YES |
||
+ | mean S 1-18 0.938 0.48 YES |
||
+ | D 1-18 0.920 0.43 YES |
||
+ | Most likely cleavage site between pos. 18 and 19: GRA-ER |
||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp_P02753_RET4_HUMAN |
||
+ | Prediction: Signal peptide |
||
+ | Signal peptide probability: 1.000 |
||
+ | Signal anchor probability: 0.000 |
||
+ | Max cleavage site probability: 0.979 between pos. 18 and 19 |
||
+ | |||
+ | SignalP predicted the cleavage very well, according to UniProt the signal peptide is from pos 1 to 18 and afterwards the cleavage. |
||
+ | |||
+ | |||
+ | SignalP-NN result: |
||
+ | >sp_Q9Y5Q6_INSL5_HUMA length = 70 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 23 0.855 0.32 YES |
||
+ | max. Y 23 0.778 0.33 YES |
||
+ | max. S 13 0.987 0.87 YES |
||
+ | mean S 1-22 0.852 0.48 YES |
||
+ | D 1-22 0.815 0.43 YES |
||
+ | Most likely cleavage site between pos. 22 and 23: VRS-KE |
||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp_Q9Y5Q6_INSL5_HUMAN |
||
+ | Prediction: Signal peptide |
||
+ | Signal peptide probability: 0.999 |
||
+ | Signal anchor probability: 0.000 |
||
+ | Max cleavage site probability: 0.911 between pos. 22 and 23 |
||
+ | |||
+ | This result is also correct predicted (UniProt: signal peptide from pos 1-22). |
||
+ | |||
+ | SignalP-NN result: |
||
+ | >sp_P11279_LAMP1_HUMA length = 70 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 29 0.978 0.32 YES |
||
+ | max. Y 29 0.903 0.33 YES |
||
+ | max. S 19 0.999 0.87 YES |
||
+ | mean S 1-28 0.960 0.48 YES |
||
+ | D 1-28 0.932 0.43 YES |
||
+ | Most likely cleavage site between pos. 28 and 29: ASA-AM |
||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp_P11279_LAMP1_HUMAN |
||
+ | Prediction: Signal peptide |
||
+ | Signal peptide probability: 1.000 |
||
+ | Signal anchor probability: 0.000 |
||
+ | Max cleavage site probability: 0.847 between pos. 28 and 29 |
||
+ | |||
+ | SignalP predicted again correctly the cleavage site. |
||
+ | |||
+ | |||
+ | SignalP-NN result: |
||
+ | >sp_P05067_A4_HUMAN length = 70 |
||
+ | Measure Position Value Cutoff signal peptide? |
||
+ | max. C 18 0.891 0.32 YES |
||
+ | max. Y 18 0.850 0.33 YES |
||
+ | max. S 2 0.992 0.87 YES |
||
+ | mean S 1-17 0.967 0.48 YES |
||
+ | D 1-17 0.909 0.43 YES |
||
+ | Most likely cleavage site between pos. 17 and 18: ARA-LE |
||
+ | |||
+ | SignalP-HMM result: |
||
+ | >sp_P05067_A4_HUMAN |
||
+ | Prediction: Signal peptide |
||
+ | Signal peptide probability: 1.000 |
||
+ | Signal anchor probability: 0.000 |
||
+ | Max cleavage site probability: 0.993 between pos. 17 and 18 |
||
+ | |||
+ | That prediction is accurate, too. |
||
+ | |||
+ | In short, SignalP predicts very accurate the position of the signaling peptide cleavage side, but fails with the non-eukaryont BACR_HALSA. |
||
+ | |||
+ | ==TargetP== |
||
+ | TargetP predict for each of the proteins a signal peptide with high probability. But P02945 which is a bacteria and has no signal peptide, the method seems to be pretty accurate. It scores the prediction at the P02945 with an reliability clause of 4, which is almost neglectable (1 is the highest confidence, 5 is the worst). |
||
+ | |||
+ | ### targetp v1.1 prediction results ################################## |
||
+ | Number of query sequences: 6 |
||
+ | Cleavage site predictions included. |
||
+ | Using NON-PLANT networks. |
||
+ | Name Len mTP SP other Loc RC TPlen |
||
+ | ---------------------------------------------------------------------- |
||
+ | sp_Q30201_HFE_HUMAN 348 0.433 0.912 0.004 S 3 22 |
||
+ | sp_P02945_BACR_HALSA 262 0.019 0.897 0.562 S 4 116 |
||
+ | sp_P02753_RET4_HUMAN 201 0.242 0.928 0.020 S 2 18 |
||
+ | sp_Q9Y5Q6_INSL5_HUMA 135 0.074 0.899 0.037 S 1 22 |
||
+ | sp_P11279_LAMP1_HUMA 417 0.043 0.953 0.017 S 1 28 |
||
+ | sp_P05067_A4_HUMAN 770 0.035 0.937 0.084 S 1 17 |
||
+ | ---------------------------------------------------------------------- |
||
+ | cutoff 0.000 0.000 0.000 |
||
+ | |||
+ | ==Discussion== |
||
+ | After using different tools with dissimilar approaches to identify transmembrane regions and signal peptides we can state, that tools which join both entities into one prediction are mostly more reliable. They give also less misclassification errors |
||
+ | than predictors of only one entity if there exist both entities. |
||
+ | |||
+ | During our test cases also we determined that the single predictors excel at their claimed entity and only misclassify if there are both entities available in the tissue. |
||
+ | |||
+ | Thus we would advice to use a single predictor like SignalP or TargetP and double-check the result with a multi predictor like SPOCTOPUS or Phobius to be sure. |
||
+ | |||
+ | =Prediction of GO terms= |
||
+ | ==General== |
||
+ | GO-Terms classify protein functions. Each GO-Term states other protein functions, therefore classifying a protein into GO-Terms means predicting it's functions. |
||
+ | |||
+ | HFE_HUMAN is annotated with 27 different GO Terms which are <ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=Q30201</ref>: |
||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
Line 210: | Line 898: | ||
|} |
|} |
||
− | + | ==GOPET== |
|
− | Gopet predicted 2 GO-Terms which have no |
+ | Gopet predicted 2 GO-Terms for the HFE_HUMAN which have no overlap to the annotation. |
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
!GOID |
!GOID |
||
Line 224: | Line 912: | ||
|} |
|} |
||
+ | '''Additional proteins:''' |
||
− | ===Pfam=== |
||
+ | |||
− | ===ProtFun 2.2=== |
||
+ | BACR_HALSA |
||
+ | |||
+ | 3 GO terms were predicted but only the one with the highest confidence of 77% is really connected to the protein: |
||
+ | * ion channel activity |
||
+ | |||
+ | RET4_HUMAN |
||
+ | |||
+ | There were 8 GO terms predicted by GOPET with a confidence from 90% to 60%, 5 of them are linked to the protein: |
||
+ | * binding |
||
+ | * retinoid binding |
||
+ | * retinol binding |
||
+ | * transporter activity |
||
+ | * retinal binding |
||
+ | |||
+ | INSL5_HUMAN |
||
+ | |||
+ | Only 1 GO term with a confidence of 80% is predicted by GOPET and it is also linked to the protein: |
||
+ | * hormone activity |
||
+ | |||
+ | LAMP1_HUMAN |
||
+ | |||
+ | GOPET has predicted 2 GO terms with 60% confidence each, but none is linked to the protein. |
||
+ | |||
+ | A4_HUMAN |
||
+ | |||
+ | GOPET predicted 13 GO terms in a range of 87% to 67% confidence, but only 7 of them are really connected to the protein: |
||
+ | * serine-type endopeptidase inhibitor activity |
||
+ | * peptidase inhibitor activity |
||
+ | * binding |
||
+ | * protein binding |
||
+ | * metal ion binding |
||
+ | * DNA binding |
||
+ | * heparin binding |
||
+ | |||
+ | ==Pfam== |
||
+ | Pfam is a database that contains protein domains and families. For our search we used the webserver at http://pfam.sanger.ac.uk/search with standard values. |
||
+ | |||
+ | Afterwards we used the pfam2go database, to find the GO-entries matching the pfam descriptions. |
||
+ | |||
+ | Pfam classifies the HFE_Human protein into two families (Figure 22): |
||
+ | |||
+ | [[File:Hfe_t3_pfam_canvas.png|400px|thumb|left|Figure 22: Pfam classification of protein families (source: pfam)]] |
||
+ | |||
+ | |||
+ | |||
+ | * Family: MHC_I (PF00129) |
||
+ | * Family: C1-set (PF07654) |
||
+ | |||
+ | For the PF00129 family are four hits at the pfam2go data: |
||
+ | |||
+ | Pfam:PF00129 MHC_I > GO:immune response ; GO:0006955 |
||
+ | Pfam:PF00129 MHC_I > GO:antigen processing and presentation ; GO:0019882 |
||
+ | Pfam:PF00129 MHC_I > GO:membrane ; GO:0016020 |
||
+ | Pfam:PF00129 MHC_I > GO:MHC class I protein complex ; GO:0042612 |
||
+ | |||
+ | [[File:Hfe_t3_pfam.png|400px|thumb|right|Figure 23: Significant Pfam-A matches (source: pfam)]] |
||
+ | |||
+ | All those GO-Entries are at the UniProt entry about HFE_Human, so this family is correct. |
||
+ | |||
+ | For the PF07654 family are no entries at the pfam2go data and thus no validateable cross links to UniProt, maybe this family is yet not included in the pfam2go data. |
||
+ | |||
+ | For a more detailed picture see Figure 23, you can see the Pfam-A matches with alignment. |
||
+ | |||
+ | '''Additional proteins:''' |
||
+ | |||
+ | |||
+ | >BACR_HALSA |
||
+ | Pfam:PF01036 Bac_rhodopsin > GO:ion channel activity ; GO:0005216 |
||
+ | Pfam:PF01036 Bac_rhodopsin > GO:ion transport ; GO:0006811 |
||
+ | Pfam:PF01036 Bac_rhodopsin > GO:membrane ; GO:0016020 |
||
+ | |||
+ | >RET4_HUMAN |
||
+ | Pfam:PF00061 Lipocalin > GO:binding ; GO:0005488 |
||
+ | |||
+ | >INSL5_HUMAN |
||
+ | Pfam:PF00049 Insulin > GO:hormone activity ; GO:0005179 |
||
+ | Pfam:PF00049 Insulin > GO:extracellular region ; GO:0005576 |
||
+ | |||
+ | >LAMP1_HUMAN |
||
+ | Pfam:PF01299 Lamp > GO:membrane ; GO:0016020 |
||
+ | |||
+ | >A4_HUMAN |
||
+ | Pfam:PF02177 APP_N > GO:binding ; GO:0005488 |
||
+ | Pfam:PF02177 APP_N > GO:integral to membrane ; GO:0016021 |
||
+ | Family: APP_Cu_bd (PF12924) --> no match at pfam2go |
||
+ | Pfam:PF00014 Kunitz_BPTI > GO:serine-type endopeptidase inhibitor activity ; GO:0004867 |
||
+ | Family: APP_E2 (PF12925) --> no match at pfam2go |
||
+ | Pfam:PF03494 Beta-APP > GO:binding ; GO:0005488 |
||
+ | Pfam:PF03494 Beta-APP > GO:integral to membrane ; GO:0016021 |
||
+ | Family: APP_amyloid (PF10515) --> no match at pfam2go |
||
+ | |||
+ | At a summary, all predicted GO-terms are correct and are cross linked to the corresponding UniProt entries. But also all predicted GO-term are not exhaustive and at UniProt there are many left. |
||
+ | |||
+ | ==ProtFun 2.2== |
||
+ | [http://www.cbs.dtu.dk/services/ProtFun/ ProtFun] is an ab initio prediction server. |
||
+ | |||
+ | '''Results'''<br> |
||
+ | ProtFun assigned immune response(GO:0006955;Process) to HFE what is correct. But ProtFun predicts just one correct GO-number for the HFE-Gen. |
||
+ | |||
<code> |
<code> |
||
Functional category Prob Odds |
Functional category Prob Odds |
||
Line 273: | Line 1,060: | ||
Metal_ion_transport 0.009 0.02 |
Metal_ion_transport 0.009 0.02 |
||
</code> |
</code> |
||
+ | |||
+ | |||
+ | '''Additional proteins:''' |
||
+ | >sp_P02945_BACR_HALSA |
||
+ | |||
+ | # Functional category Prob Odds |
||
+ | Amino_acid_biosynthesis 0.033 1.495 |
||
+ | Biosynthesis_of_cofactors 0.186 2.589 |
||
+ | Cell_envelope 0.029 0.483 |
||
+ | Cellular_processes 0.051 0.694 |
||
+ | Central_intermediary_metabolism 0.045 0.711 |
||
+ | Energy_metabolism 0.138 1.537 |
||
+ | Fatty_acid_metabolism 0.016 1.265 |
||
+ | Purines_and_pyrimidines 0.302 1.244 |
||
+ | Regulatory_functions 0.013 0.080 |
||
+ | Replication_and_transcription 0.019 0.073 |
||
+ | Translation 0.059 1.339 |
||
+ | Transport_and_binding => 0.791 1.929 |
||
+ | |||
+ | # Enzyme/nonenzyme Prob Odds |
||
+ | Enzyme 0.199 0.696 |
||
+ | Nonenzyme => 0.801 1.122 |
||
+ | |||
+ | # Enzyme class Prob Odds |
||
+ | Oxidoreductase (EC 1.-.-.-) 0.114 0.549 |
||
+ | Transferase (EC 2.-.-.-) 0.031 0.091 |
||
+ | Hydrolase (EC 3.-.-.-) 0.057 0.180 |
||
+ | Lyase (EC 4.-.-.-) 0.020 0.430 |
||
+ | Isomerase (EC 5.-.-.-) 0.010 0.321 |
||
+ | Ligase (EC 6.-.-.-) 0.017 0.326 |
||
+ | |||
+ | # Gene Ontology category Prob Odds |
||
+ | Signal_transducer 0.258 1.205 |
||
+ | Receptor 0.355 2.087 |
||
+ | Hormone 0.001 0.206 |
||
+ | Structural_protein 0.006 0.200 |
||
+ | Transporter => 0.440 4.036 |
||
+ | Ion_channel 0.010 0.169 |
||
+ | Voltage-gated_ion_channel 0.004 0.172 |
||
+ | Cation_channel 0.078 1.689 |
||
+ | Transcription 0.026 0.205 |
||
+ | Transcription_regulation 0.028 0.226 |
||
+ | Stress_response 0.012 0.139 |
||
+ | Immune_response 0.011 0.128 |
||
+ | Growth_factor 0.010 0.727 |
||
+ | Metal_ion_transport 0.049 0.106 |
||
+ | |||
+ | ProtFun predicted correctly the functional category, enzyme/no enzyme classification and the gene ontology category. All three predictions are correct. |
||
+ | |||
+ | >sp_P02753_RET4_HUMAN |
||
+ | |||
+ | # Functional category Prob Odds |
||
+ | Amino_acid_biosynthesis 0.017 0.751 |
||
+ | Biosynthesis_of_cofactors 0.044 0.610 |
||
+ | Cell_envelope => 0.804 13.186 |
||
+ | Cellular_processes 0.075 1.021 |
||
+ | Central_intermediary_metabolism 0.197 3.128 |
||
+ | Energy_metabolism 0.043 0.475 |
||
+ | Fatty_acid_metabolism 0.016 1.265 |
||
+ | Purines_and_pyrimidines 0.275 1.131 |
||
+ | Regulatory_functions 0.013 0.080 |
||
+ | Replication_and_transcription 0.022 0.084 |
||
+ | Translation 0.032 0.721 |
||
+ | Transport_and_binding 0.800 1.951 |
||
+ | |||
+ | # Enzyme/nonenzyme Prob Odds |
||
+ | Enzyme => 0.544 1.900 |
||
+ | Nonenzyme 0.456 0.639 |
||
+ | |||
+ | # Enzyme class Prob Odds |
||
+ | Oxidoreductase (EC 1.-.-.-) 0.095 0.458 |
||
+ | Transferase (EC 2.-.-.-) 0.038 0.109 |
||
+ | Hydrolase (EC 3.-.-.-) 0.235 0.742 |
||
+ | Lyase (EC 4.-.-.-) => 0.059 1.264 |
||
+ | Isomerase (EC 5.-.-.-) 0.010 0.321 |
||
+ | Ligase (EC 6.-.-.-) 0.017 0.326 |
||
+ | |||
+ | # Gene Ontology category Prob Odds |
||
+ | Signal_transducer 0.202 0.942 |
||
+ | Receptor 0.147 0.862 |
||
+ | Hormone 0.004 0.667 |
||
+ | Structural_protein 0.002 0.058 |
||
+ | Transporter 0.025 0.232 |
||
+ | Ion_channel 0.016 0.288 |
||
+ | Voltage-gated_ion_channel 0.003 0.148 |
||
+ | Cation_channel 0.010 0.215 |
||
+ | Transcription 0.027 0.207 |
||
+ | Transcription_regulation 0.025 0.196 |
||
+ | Stress_response 0.161 1.829 |
||
+ | Immune_response => 0.239 2.813 |
||
+ | Growth_factor 0.023 1.617 |
||
+ | Metal_ion_transport 0.009 0.020 |
||
+ | |||
+ | ProtFun predicted not correct, the functional category and the enzyme/no enzyme classification and enzyme class is according to UniProt wrong. Only the gene ontology category is correct. |
||
+ | |||
+ | >sp_Q9Y5Q6_INSL5_HUMAN |
||
+ | |||
+ | # Functional category Prob Odds |
||
+ | Amino_acid_biosynthesis 0.011 0.484 |
||
+ | Biosynthesis_of_cofactors 0.040 0.558 |
||
+ | Cell_envelope => 0.756 12.393 |
||
+ | Cellular_processes 0.033 0.448 |
||
+ | Central_intermediary_metabolism 0.048 0.755 |
||
+ | Energy_metabolism 0.036 0.397 |
||
+ | Fatty_acid_metabolism 0.016 1.265 |
||
+ | Purines_and_pyrimidines 0.144 0.592 |
||
+ | Regulatory_functions 0.014 0.087 |
||
+ | Replication_and_transcription 0.020 0.075 |
||
+ | Translation 0.032 0.735 |
||
+ | Transport_and_binding 0.834 2.033 |
||
+ | |||
+ | # Enzyme/nonenzyme Prob Odds |
||
+ | Enzyme 0.209 0.729 |
||
+ | Nonenzyme => 0.791 1.109 |
||
+ | |||
+ | # Enzyme class Prob Odds |
||
+ | Oxidoreductase (EC 1.-.-.-) 0.056 0.268 |
||
+ | Transferase (EC 2.-.-.-) 0.031 0.091 |
||
+ | Hydrolase (EC 3.-.-.-) 0.062 0.195 |
||
+ | Lyase (EC 4.-.-.-) 0.020 0.430 |
||
+ | Isomerase (EC 5.-.-.-) 0.010 0.321 |
||
+ | Ligase (EC 6.-.-.-) 0.017 0.327 |
||
+ | |||
+ | # Gene Ontology category Prob Odds |
||
+ | Signal_transducer 0.374 1.746 |
||
+ | Receptor 0.128 0.750 |
||
+ | Hormone => 0.247 37.936 |
||
+ | Structural_protein 0.001 0.041 |
||
+ | Transporter 0.025 0.228 |
||
+ | Ion_channel 0.010 0.168 |
||
+ | Voltage-gated_ion_channel 0.003 0.131 |
||
+ | Cation_channel 0.010 0.215 |
||
+ | Transcription 0.054 0.425 |
||
+ | Transcription_regulation 0.091 0.724 |
||
+ | Stress_response 0.099 1.128 |
||
+ | Immune_response 0.178 2.090 |
||
+ | Growth_factor 0.061 4.379 |
||
+ | Metal_ion_transport 0.009 0.020 |
||
+ | |||
+ | FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again. |
||
+ | |||
+ | >sp_P11279_LAMP1_HUMAN |
||
+ | |||
+ | # Functional category Prob Odds |
||
+ | Amino_acid_biosynthesis 0.011 0.484 |
||
+ | Biosynthesis_of_cofactors 0.053 0.735 |
||
+ | Cell_envelope => 0.804 13.186 |
||
+ | Cellular_processes 0.027 0.373 |
||
+ | Central_intermediary_metabolism 0.138 2.188 |
||
+ | Energy_metabolism 0.037 0.411 |
||
+ | Fatty_acid_metabolism 0.016 1.265 |
||
+ | Purines_and_pyrimidines 0.533 2.195 |
||
+ | Regulatory_functions 0.015 0.090 |
||
+ | Replication_and_transcription 0.019 0.073 |
||
+ | Translation 0.027 0.613 |
||
+ | Transport_and_binding 0.834 2.033 |
||
+ | |||
+ | # Enzyme/nonenzyme Prob Odds |
||
+ | Enzyme 0.276 0.965 |
||
+ | Nonenzyme => 0.724 1.014 |
||
+ | |||
+ | # Enzyme class Prob Odds |
||
+ | Oxidoreductase (EC 1.-.-.-) 0.039 0.187 |
||
+ | Transferase (EC 2.-.-.-) 0.046 0.134 |
||
+ | Hydrolase (EC 3.-.-.-) 0.058 0.184 |
||
+ | Lyase (EC 4.-.-.-) 0.020 0.430 |
||
+ | Isomerase (EC 5.-.-.-) 0.010 0.321 |
||
+ | Ligase (EC 6.-.-.-) 0.017 0.326 |
||
+ | |||
+ | # Gene Ontology category Prob Odds |
||
+ | Signal_transducer 0.396 1.849 |
||
+ | Receptor 0.282 1.659 |
||
+ | Hormone 0.001 0.206 |
||
+ | Structural_protein 0.011 0.408 |
||
+ | Transporter 0.024 0.222 |
||
+ | Ion_channel 0.008 0.147 |
||
+ | Voltage-gated_ion_channel 0.002 0.111 |
||
+ | Cation_channel 0.010 0.215 |
||
+ | Transcription 0.032 0.247 |
||
+ | Transcription_regulation 0.018 0.142 |
||
+ | Stress_response 0.246 2.795 |
||
+ | Immune_response => 0.371 4.368 |
||
+ | Growth_factor 0.013 0.956 |
||
+ | Metal_ion_transport 0.009 0.020 |
||
+ | |||
+ | FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again. |
||
+ | |||
+ | >sp_P05067_A4_HUMAN |
||
+ | |||
+ | # Functional category Prob Odds |
||
+ | Amino_acid_biosynthesis 0.020 0.921 |
||
+ | Biosynthesis_of_cofactors 0.261 3.623 |
||
+ | Cell_envelope => 0.804 13.186 |
||
+ | Cellular_processes 0.053 0.730 |
||
+ | Central_intermediary_metabolism 0.184 2.920 |
||
+ | Energy_metabolism 0.023 0.259 |
||
+ | Fatty_acid_metabolism 0.016 1.265 |
||
+ | Purines_and_pyrimidines 0.417 1.716 |
||
+ | Regulatory_functions 0.013 0.084 |
||
+ | Replication_and_transcription 0.029 0.109 |
||
+ | Translation 0.027 0.613 |
||
+ | Transport_and_binding 0.827 2.016 |
||
+ | |||
+ | # Enzyme/nonenzyme Prob Odds |
||
+ | Enzyme => 0.392 1.368 |
||
+ | Nonenzyme 0.608 0.852 |
||
+ | |||
+ | # Enzyme class Prob Odds |
||
+ | Oxidoreductase (EC 1.-.-.-) 0.024 0.114 |
||
+ | Transferase (EC 2.-.-.-) 0.208 0.603 |
||
+ | Hydrolase (EC 3.-.-.-) 0.190 0.600 |
||
+ | Lyase (EC 4.-.-.-) 0.020 0.430 |
||
+ | Isomerase (EC 5.-.-.-) 0.010 0.324 |
||
+ | Ligase (EC 6.-.-.-) 0.048 0.946 |
||
+ | |||
+ | # Gene Ontology category Prob Odds |
||
+ | Signal_transducer 0.126 0.586 |
||
+ | Receptor 0.036 0.211 |
||
+ | Hormone 0.001 0.206 |
||
+ | Structural_protein => 0.034 1.205 |
||
+ | Transporter 0.024 0.222 |
||
+ | Ion_channel 0.009 0.162 |
||
+ | Voltage-gated_ion_channel 0.002 0.108 |
||
+ | Cation_channel 0.010 0.215 |
||
+ | Transcription 0.043 0.335 |
||
+ | Transcription_regulation 0.018 0.143 |
||
+ | Stress_response 0.076 0.862 |
||
+ | Immune_response 0.016 0.183 |
||
+ | Growth_factor 0.005 0.372 |
||
+ | Metal_ion_transport 0.009 0.020 |
||
+ | |||
+ | ProtFun failed completely, because all predictions of the functional category and enzyme/no enzyme and the gene ontology category are wrong. |
||
+ | |||
+ | == References == |
||
+ | <references /> |
||
+ | |||
+ | [[Category : Hemochromatosis]] |
Latest revision as of 00:01, 1 September 2011
by Robert Greil and Cedric Landerer
Contents
Secondary structure prediction
For the secondary structure prediction, we used a reference sequence obtained from UniProt. We also used the annotations of the UniProt sequence as a reference to compare the predicted secondary structures.
PSIPRED
PSI-PRED use the PSI-BLAST output as input for a neuronal network which has a single hidden layer and a feed-forward back-propagation architecture to predict the secondary structure.
Results
PSI-PRED predicts a alpha/beta structure. The transmembrane region is predicted as a beta region. A graphical representation of the result is shown in Figure 1.
PSIPRED HFORMAT (PSIPRED V3.0) Conf: 999851589999999877513567886245556456636899750389988756755687 Pred: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 10 20 30 40 50 60 Conf: 318998225536664688990669998865311211002358577441156788603899 Pred: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 70 80 90 100 110 120 Conf: 987799319835459889765910588728988756689786135787788899999876 Pred: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 130 140 150 160 170 180 Conf: 310271499889888616322000378810000468999601699981450765189996 Pred: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL 190 200 210 220 230 240 Conf: 288106667520025355899875899999965999872169986699998826885259 Pred: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS 250 260 270 280 290 300 Conf: 999711124320001367777622367764115889887620212359 Pred: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE 310 320 330 340
Jpred3
Jpred use the Jnet algorithm which provides "a three-state (a-helix, ß-strand and coil) prediction of secondary structure at an accuracy of 81.5%" <ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.abstract</ref>.
Results
Jpred found in it's first blast search a lot of homologous hits with an e-value range from e-163 to 4e-44. There are some self hits included. We continued to the prediction which is:
Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDD SS: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE-- Seq: QLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHN SS: EEEEEE-----EEEE----------HHHHHHHHHHHHHHHHHHHHHHHHHH---- Seq: HSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPT SS: -----EEEEEEEEEE------EEEEEEE-----EEEEEE----EEE-------HH Seq: KLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSV SS: HHHHH--HHHHHHHHHH------HHHHHHHHHH-H-------EEEEE-------- Seq: TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPG SS: -EEEEEEE------EEEEEEE----------EE----------EEEEEEEEE--- Seq: EEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILR SS: ---EEEEEEEE------EEEEE---------HHHHHHHHHHHHHHHHHHHHHHHH Seq: KRQGSRGAMGHYVLAERE SS: HH----------------
Comparison with DSSP
DSSP was designed by Wolfgang Kabsch and Chris Sander to provide a standard for the secondary structure assignment. DSSP calculates the secondary structure from PDB structures by using the distances between the atoms.
Results
Because, the PDB sequence is not complete, the dssp assignment is also incomplete. The interesting parts - the signal peptide and the cytoplasmic part - which are predicted as disordered are not covered by DSSP. PSIPRED and JPred predicted the transmembrane region well but assigned the - as disordered predicted - N- and C-terminus as a helical or beta sheet region. But the UniProt assignment gives no structure to this regions as well. Therefore, these regions may unstructured and not yet recognized as disordered regions.
UniProt: ---------------------------EEEEEEEEEEE----EEE--EEEEEE--EEEEE DSSP: --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEE PSIPRED: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE JPred: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE--EEEEE AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF DSSPSeq: RSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 10 20 30 40 50 60 UniProt: EEEEE--EEE--------TTTHHHHHHHHHHHHHHHHHHHHHHHHHHTTT-EEE--EEEE DSSP: EESSS--EEE-STTS-SSTTTTHHHHHHHHHHHHHHHHHHHHHHHHHTTT-SSS--EEEE PSIPRED: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE JPred: E-----EEEE----------HHHHHHHHHHHHHHHHHHHHHHHHHH---------EEEEE AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV DSSPSEQ: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 70 80 90 100 110 120 UniProt: EEEEEE-----EEEEEEEEE--EEEEEEEHHH-EEEEEE---HHHHHHHH---HHHHHHH DSSP: EEEEEE-TTS-EEEEEEEEETTEEEEEEEGGGTEEEESSGGGHHHHHHHHSSTHHHHHHH PSIPRED: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH JPred: EEEEE------EEEEEEE-----EEEEEE----EEE-------HHHHHHH--HHHHHHHH AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR DSSPSEQ: ILGaEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 130 140 150 160 170 180 UniProt: HHHH-HHHHHHHHHHHHHTTT-------EEEEEEEE----EEEEEEEEEEEEE--EEEEE DSSP: HHHHTHHHHHHHHHHHHHTTTSS--B--EEEEEEEE-SS-EEEEEEEEEEBSS--EEEEE PSIPRED: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE JPred: HH------HHHHHHHHHH-H-------EEEEE---------EEEEEEE------EEEEEE AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL DSSPSEQ: AYLERDaPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRbRALNYYPQNITMKWL 190 200 210 220 230 240 UniProt: E------HHH----EEEE-----EEEEEEEEE---HHHHEEEEEE---EEE-EEEE---- DSSP: ETTEE--GGGS---EEEE-TTS-EEEEEEEEE-TTGGGGEEEEEE-TTSSS-EEEE- PSIPRED: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC JPred: E----------EE----------EEEEEEEEE------EEEEEEEE------EEEEE--- AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS DSSPSEQ: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTbQVEHPGLDQPLIVIW 250 260 270 280 290 300 UniProt: ------------------------------------------------ DSSP: PSIPRED: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC JPred: ------HHHHHHHHHHHHHHHHHHHHHHHHHH---------------- AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE DSSPSEQ: 310 320 330 340
We see a good overlap of the results of all the different methods except for the region at the C-terminus. Here, the PDB file did not provide a sequence, and UniProt did not assign a secondary structure to this region. But interestingly, PSIPRED predict a beta-sheet in this region while JPred predict a helix. As we will show later, this region is proposed to be disordered by the most prediction tools. This could be the reason for the different assignments.
Prediction of disordered regions
The HFE-Gen is not yet known as disordered. It is not contained in the Disprot<ref>http://www.disprot.org/</ref> database.
The prediction of unstructured regions predict several disordered regions in the protein, but most of them are predicted within secondary structure elements. Just the predicted disordered regions at the C- and N-terminus might be really unstructured but not yet experimentally recognized because, these regions have no structural assignment. But based on this, we can not guess which tool works best for our case.
The predictions are shown below.
DISOPRED
For the prediction, we used the DISOPRED-Server at http://bioinf.cs.ucl.ac.uk/disopred/
DISOPRED is a prediction tool for disordered regions based on a linear SVM. The SVM is trained with 750 non-redundant sequences with high resolution X-ray structures. "Disorder was identified with those residues that appear in the sequence records but with coordinates missing from the electron density map." <ref>http://bioinf.cs.ucl.ac.uk/index.php?id=806</ref> For each protein, a sequence profile was generated by using PSI-BLAST search against a filtered database. The PSI-BLAST profiles were used as input vectors for the SVM.
Result
Disopred predictes two disordered residues at the signal peptide and a disordered region at the end of the sequence which is located inside the cell.
AA:Target sequence Pred:Residue disorder prediction(.)= ordered residue(*)=Disordered residue conf:997600000000000000000000000000000000000000000000000000000000 pred:**.......................................................... AA:MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 10 20 30 40 50 60 conf:000120011000000000000000000000000000000000000000000000000000 pred:............................................................ AA:YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 70 80 90 100 110 120 conf:000000000000000000000000000000000000000000000000000000000000 pred:............................................................ AA:ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 130 140 150 160 170 180 conf:000000000000000000000002456777878777766530000000000000000000 pred:..............................*.*........................... AA:AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL 190 200 210 220 230 240 conf:000035555545543000000000000000000000000000000000000001354667 pred:............................................................ AA:KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS 250 260 270 280 290 300 conf:777766643300000000000000047889999999999999898999 pred:...........................********************* AA:PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE 310 320 330 340 DISOPRED predictions for a false positive rate threshold of: 2%
POODLE
POODLE stands for Prediction Of Order and Disorder by machine LEarning.
POODLE provides three different predictions
- POODLE-S: short disorder regions prediction
- POODLE-L: long disorder regions prediction (longer 40 residues)
- unfolded protein prediction
All POODLE variants predict a disordered region at the end of the protein which contains a transmembrane region (pos: 307-330), this shows an evidence for a disordered region at the C-Terminus. But also, all variants predict a short disordered region at the beginning of the sequence which is a part of the signal peptide (pos: 1-22). The score distribution over the sequence is shown in the Figures 3 to 6. The threshold to distinguish between an ordered and a disordered region is 0.5
POODLE-I
POODLE-I (series only) predicted 4 disordered regions within the protein sequence.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF **************---------------------------------------------- YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV -------**********---******------*--------------------------- ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------------------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ---------------------***************------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ----*********----------------------------------------******* PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE ************************************************
POODLE-S
POODLE-S (using missing residues) predicts 6 short disordered regions within the protein sequence.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF -**************--------------------------------------------- YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV -------**********---******---------------------------------- ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------------------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ---------------------***************------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ----*********----------------------------------------******* PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE *--------------------------------********-------
POODLE-S (using High B-Factor residues) predicts 2 short disordered regions within the protein sequence.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF -*-***------------------------------------------------------ YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV ------------------------------------------------------------ ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------******------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ------------------------------------------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ------------------------------------------------------------ PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE ------------------------------------------------
POODLE-L
POODLE-L predicts a disordered region from 296 to the end.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF ------------------------------------------------------------ YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV ------------------------------------------------------------ ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------------------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ------------------------------------------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ------------------------------------------------------****** PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE ************************************************
IUPRED
IUPRED use the estimated pairwise energy to recognize unstructured regions within protein sequences. For these, they use the assumption, that all globular proteins have an amino acid composition which gives it the potential to form a large number of favorable interactions. The score distribution over the sequence is shown in the Figures 7 to 9. IUPRED use 0.5 as threshold to distinguish between a disordered and a ordered region.
Results
The short term prediction predicts 5 short regions. There are also disordered residues at the beginning and in the signal peptide.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF ***--------------------------------------------------------- YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV ------------------------------------------------------------ ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------------------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ------------------------------------------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ---------********----------***--------*-****---------------- PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE ---------------------------------------------***
The long term prediction predicted 7 disordered residues, but just one short region.
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF ------------------------------------------------------------ YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV ------------------------------------------------------------ ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR ------------------------------------------------------------ AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL ------------------------------------------------------------ KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS ---------******-------------------------*------------------- PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE ------------------------------------------------
The prediction of sturcured regions predicts one globular domain from 1-348 (Figure 9). This means, that the whole protein is structured. This is a contradiction to the prediction of POODLE, but because of the weak evidence given by the other IUPRED-methods not a real contradiction to the other results of IUPRED.
META-Disorder
For this task, we used the PredictProtein Server at https://www.predictprotein.org. META-Disorder, published in 2009 by Avner Schlessinger, Marco Punta, Guy Yachdav, Laszlo Kajan and Burkhard Rost, use a combined prediction of ORSnet PROFbval and Ucon.
predicted secondary structure composision
sec str type | H | E | L |
---|---|---|---|
% in protein | 27.30 | 28.74 | 43.97 |
Prediction of disordered residues by META-Disorder (last coloumn)
Number Residue NORSnet NORS2st PROFbval bval2st Ucon Ucon2st MD_raw MD_rel MD2st
.... 242 D 0.13 - 0.70 D 0.76 D 0.444 2 - 243 K 0.13 - 0.69 D 0.76 D 0.480 1 - 244 Q 0.13 - 0.66 D 0.93 D 0.531 0 D 245 P 0.17 - 0.73 D 0.92 D 0.520 0 D 246 M 0.28 - 0.65 D 0.90 D 0.525 0 D 247 D 0.31 - 0.68 D 0.87 D 0.515 0 - 248 A 0.36 - 0.70 D 0.87 D 0.520 0 D 249 K 0.40 - 0.69 D 0.89 D 0.485 1 - .... 344 L 0.37 - 0.59 D 0.18 - 0.515 0 - 345 A 0.35 - 0.74 D 0.17 - 0.515 0 - 346 E 0.31 - 0.89 D 0.17 - 0.520 0 D 347 R 0.35 - 0.91 D 0.23 - 0.525 0 D 348 E 0.34 - 0.92 D 0.38 - 0.520 0 D
Key for output ---------------- Number - residue number Residue - amino-acid type NORSnet - raw score by NORSnet (prediction of unstructured loops) NORS2st - two-state prediction by NORSnet; D=disordered PROFbval - raw score by PROFbval (prediction of residue flexibility from sequence) Bval2st - two-state prediction by PROFbval Ucon - raw score by Ucon (prediction of protein disorder using predicted internal contacts) Ucon2st - two-state prediction by Ucon MD - raw score by MD (prediction of protein disorder using orthogonal sources) MD_rel - reliability of the prediction by MD; values range from 0-9. 9=strong prediction MD2st - two-state prediction by MD
META-Disorder predict a very short disordered region of 3 residues at the end of the protein but with a week evidence of around 0.5. Therefore it is quite unlikely to have a disordered region at the C-Terminus if we just look at this method.
Discussion
Most tools predict a disordered region at the C-terminus of the protein. The predicted region is not part of the PDB structure and mostly a transmembrane region. The major issue by rating this prediction as correct, is that the annotation specially for this region is incomplete. But as this is a transmembrane region, it is not surprising, that this part is not annotated. As a disordered region is a region which only forms a structure by interacting with an antagonist (like another protein or DNA or just small molecules) it is possible, that the region is a disordered one, which forms a structure by getting imported into the membrane. But it is also possible, that a defined structure exists and just because it is part of a membrane region, witch are hard to solve, it is not known yet. A part which afflicted us by doubts is, that the disordered regions predicted within the protein are mostly loops. Without any further evidence, and because just one tool predict us a larger disordered region at the C-terminus, we would assume the protein as not disordered.
Prediction of transmembrane alpha-helices and signal peptides
General
We were given five additional proteins to work with and predict transmembrane regions, signal peptides and GO terms for. That was done, because most of the practicals proteins are no membrane proteins and therefore produce only "no membrane" results. Thus the three membrane proteins [BACR_HALSA], [LAMP1_HUMAN] and [A4_HUMAN] were provided, but also our HFE Protein [HFE_HUMAN] is an membrane protein.
To give you a quick overview about the protein properties, look at the following table:
Accession | Entry name | Organism | Subcelluar location | Signal peptide |
Q30201 | HFE_HUMAN | Homo sapiens (Human) | Membrane; Single-pass type I membrane protein | 1-22 |
P02945 | BACR_HALSA | Halobacterium salinarium / (Halobacterium halobium) | Cell membrane; Multi-pass membrane protein | no |
P02753 | RET4_HUMAN | Homo sapiens (Human) | Secreted | 1-18 |
Q9Y5Q6 | INSL5_HUMAN | Homo sapiens (Human) | Secreted | 1-22 |
P11279 | LAMP1_HUMAN | Homo sapiens (Human) | Cell membrane; Single-pass type I membrane protein [...] | 1-28 |
P05067 | A4_HUMAN | Homo sapiens (Human) | Membrane; Single-pass type I membrane protein | 1-17 |
We are going to predict membranes and signaling for these six proteins using different tools. Because our normally addressed protein HFE_HUMAN is an membrane protein and therefore we see the prediction accuracy by using it, we will give only graphical and detailed overview about the results of HFE_HUMAN and group the additional proteins in textual form.
We use the entries at UniProt for the real ground truth and compare the prediction results shortly with them.
Why is the prediction of transmembrane helices and signal peptides grouped together here?
Transmembrane helices and signal peptides are very hard to differ, if the method only predicts one of them. Therefore many false-positives are produced, because both of them consist almost only out of hydrophobic residues. Because of that, many methods gone the way, to combine the prediction of transmembrane helices and signal peptides, reducing drastically the false-positives.
There are different types of signal peptides, but they all work as an import or export information.
- Import into the peroxisome
- Import into the nucleus
- Export from the nucleus
- Import into the mitochondrium
TMHMM
TMHMM is a tool for predicting membrane topology (transmembrane helices) in proteins based on a hidden Markov model with different states. It divides the regions in "inside", "outside" and "TMhelix". But TMHMM can not predict Signal Peptides
TMHMM was used locally in our linux box, after correcting some path issues inside some config files.
The command we used was:
- tmhmm x.fasta > x.tmhmm
where 'x' stands for one of the UniProt entry name of the proteins. Afterwards we tried to plot the result of HFE_HUMAN with gnuplot, but this was not working either, because of path issues inside the (automatically created) gnuplot script of tmhmm. After correcting the path issues again, gnuplot worked fine and produced successfully graphical output.
A color code is applied to provide easier reading:
- green: The predicted region matches the experimental resolved region from UniProt (+/- 5 residues allowed). TMHMM succeeded.
- yellow: The predicted region does only partially match the experimental region (mostly errors with signaling peptides). TMHMM has to be improved.
- red: The predicted region does absolutely not match the experimental region. TMHMM failed.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
Q30201|HFE_HUMAN | TMHMM2.0 | outside | 1 | 306 | Signal peptide | 1 | 22 |
Q30201|HFE_HUMAN | Extracelluar | 23 | 306 | ||||
Q30201|HFE_HUMAN | TMHMM2.0 | TMhelix | 307 | 329 | Helical | 307 | 330 |
Q30201|HFE_HUMAN | TMHMM2.0 | inside | 330 | 348 | Cytoplasmic | 331 | 348 |
TMHMM misses clearly the the signal peptide and counts the region as outside (1-306), which is correct according to UniProt. Also the TMhelix (307-329) and the inside region (330-348) is placed right, only with one amino acid deviation, but that is insignificant. Therefore TMHMM was very successful in prediction of right regions. The results are shown in Figure 10.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
sp_P02945_BACR_HALSA | TMHMM2.0 | outside | 1 | 22 | Extracellular | 14 | 23 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 23 | 42 | Helical; Name=Helix A | 24 | 42 |
sp_P02945_BACR_HALSA | TMHMM2.0 | inside | 43 | 54 | Cytoplasmic | 43 | 56 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 55 | 77 | Helical; Name=Helix B | 57 | 75 |
sp_P02945_BACR_HALSA | TMHMM2.0 | outside | 78 | 91 | Extracellular | 76 | 91 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 92 | 114 | Helical; Name=Helix C | 92 | 109 |
sp_P02945_BACR_HALSA | TMHMM2.0 | inside | 115 | 120 | Cytoplasmic | 110 | 120 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 121 | 143 | Helical; Name=Helix D | 121 | 140 |
sp_P02945_BACR_HALSA | TMHMM2.0 | outside | 144 | 147 | Extracellular | 141 | 147 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 148 | 170 | Helical; Name=Helix E | 148 | 167 |
sp_P02945_BACR_HALSA | TMHMM2.0 | inside | 171 | 189 | Cytoplasmic | 168 | 185 |
sp_P02945_BACR_HALSA | TMHMM2.0 | TMhelix | 190 | 212 | Helical; Name=Helix F | 186 | 204 |
sp_P02945_BACR_HALSA | TMHMM2.0 | outside | 213 | 262 | Extracellular | 205 | 216 |
Helical; Name=Helix G | 217 | 236 | |||||
Cytoplasmic | 237 | 262 |
As clearly visible, TMHMM is also capable of prediction regions of non eukaryots almost completely correctly. Only the last helical and cytoplasmic regions are missed.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
sp_P02753_RET4_HUMAN | TMHMM2.0 | outside | 1 | 201 | Signal peptide | 1 | 18 |
TMHMM misses the signaling peptide but everything else is superb, because RET4_Human is a protein that gets secreted.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
sp_Q9Y5Q6_INSL5_HUMAN | TMHMM2.0 | outside | 1 | 135 | Signal peptide | 1 | 22 |
As usual, the TMHMM misses the signaling peptide but predicts accurately. INSL5_Human is a hormone and thus secreted at the extracellular regions.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
sp_P11279_LAMP1_HUMAN | TMHMM2.0 | inside | 1 | 10 | Signal peptide | 1 | 28 |
sp_P11279_LAMP1_HUMAN | TMHMM2.0 | TMhelix | 11 | 33 | |||
sp_P11279_LAMP1_HUMAN | TMHMM2.0 | outside | 34 | 383 | Lumenal | 29 | 382 |
sp_P11279_LAMP1_HUMAN | TMHMM2.0 | TMhelix | 384 | 406 | Helical | 383 | 405 |
sp_P11279_LAMP1_HUMAN | TMHMM2.0 | inside | 407 | 417 | Cytoplasmic | 406 | 417 |
TMHMM does not so well predict this protein. It mixes up the region of the signaling peptide with 'inside' and 'TMhelix' and is completely wrong at lumenal with 'outside'. The rest is predicted well.
TMHMM | UniProt | ||||||
---|---|---|---|---|---|---|---|
id | version | region | start | end | region | start | end |
sp_P05067_A4_HUMAN | TMHMM2.0 | outside | 1 | 700 | Signal peptide | 1 | 17 |
Extracellular | 18 | 699 | |||||
sp_P05067_A4_HUMAN | TMHMM2.0 | TMhelix | 701 | 723 | Helical | 700 | 723 |
sp_P05067_A4_HUMAN | TMHMM2.0 | inside | 724 | 770 | Cytoplasmic | 724 | 770 |
TMHMM predicted again almost correctly and missed only the beginning unpredictable signaling peptide.
Phobius and PolyPhobius
For Phobius and PolyPhobius, we used the webservice<ref>http://www.ncbi.nlm.nih.gov/pubmed/17483518?dopt=Abstract</ref> at http://phobius.sbc.su.se/ with standard settings.
Phobius is a combined predictor for transmembrane protein topology and signal peptide. Phobius models different regions of the sequence in a series of interconnected states of a HMM.<ref>http://www.ncbi.nlm.nih.gov/pubmed/15111065?dopt=Abstract</ref>
PolyPhobius is a hidden Markov model (HMM) decoding algorithm. It combines probabilities for sequence features of homologous by considering the average of the posterior label probability of each position in a global sequence alignment. PolyPhobius is benchmarked by Phobius. <ref>http://www.ncbi.nlm.nih.gov/pubmed/15961464?dopt=Abstract</ref>
Phobius
Phobius predicts very accurate as seen below. The transmembrane region is predicted just 1-2 residues upstream from the annotated region. The same holds for the topological domains before and after the transmembrane region. Also the signal peptide is correctly predicted. The probability distribution is shown in Figure 11.
PREDICTED ANNOTATION ID sp|Q30201|HFE_HUMAN FT SIGNAL 1 21 | 1-20 FT REGION 1 7 N-REGION. FT REGION 8 16 H-REGION. FT REGION 17 21 C-REGION. FT TOPO_DOM 22 304 NON CYTOPLASMIC. | 23-306 FT TRANSMEM 305 329 | 307-330 FT TOPO_DOM 330 348 CYTOPLASMIC. | 331-348
PolyPhobius
PolyPhobius also predicts very accurate but in our case not as accurate as Phobius. But this is just a small difference. The probability distribution is shown in Figure 12.
PREDICTED ANNOTATION ID sp|Q30201|HFE_HUMAN FT SIGNAL 1 23 | 1-20 FT REGION 1 5 N-REGION. FT REGION 6 19 H-REGION. FT REGION 20 23 C-REGION. FT TOPO_DOM 24 304 NON CYTOPLASMIC. | 23-306 FT TRANSMEM 305 329 | 307-330 FT TOPO_DOM 330 348 CYTOPLASMIC. | 331-348
Additional proteins
- BACR_HALSA: Phobius/Polyphobius are almost the same. There is only a slight change in the domain length, furthermore both methods predicted the membrane topologies right.
- RET4_HUMAN: Phobius/Polyphobius predicted the position of the signaling peptide correct, the overall prediction is correct, too.
- INSL5_HUMAN: Phobius/Polyphobius predict again very accurate and correct. The signaling peptide and extracellular region are at the correct positions.
- LAMP1_HUMAN: Phobius/Polyphobius predicted the membrane topology correctly.
- A4_HUMAN: Phobius/Polyphobius predicted signaling peptide position is correct.
Summing up, Phobius/Polyphobius are very accurate at their predictions about the membrane topology. Polyphobius takes much more time to produce an result, but provide no really better result - sometimes even a bit worser. Because of that, we decided that Phobious is the more comfortable one.
OCTOPUS and SPOCTOPUS
OCTOPUS is a combined method of HMM's and artificial neural networks. OCTOPUS first create a sequence profile by homology search using BLAST. The profile is used as the input to a set of neural networks which predict the preference of the location for each residue. Each residue is predicted to be either inside or outside the cell and located in a transmembrane (M), interface (I), close loop (L) or globular loop (G) environment.
SPOCTOPUS is an extended version of OCTOPUS that can also predict signal peptides. It use a neural network to predict a signal peptide if the score for each of the 70 N-Terminal residues is high enough.
Both, OCTOPUS and SPOCTOPUS predict the signal peptide and the transmembrane region correctly as you can see in the images below. Also both methods predict a signal peptide at the N-terminus which has the correct length. Figure 13 and Figure 14 show the prediction for the HFE protein.
Additional proteins The probability distribution for the additional proteins is shown in Figures 15 to 19 below.
- BACR_HALSA:
The prediction result of OCTOPUS and SPOCTOPUS is the same. They are both correct, because BACR_HALSA does not contain any signal peptides.
- RET4_HUMAN:
OCTOPUS predicts an TM-helix instead of the signal peptide, SPOCTOPUS corrects this error.
- INSL5_HUMAN:
Same error for OCTOPUS as seen by RET4_HUMAN, SPOCTOPUS corrects this error again.
- LAMP1_HUMAN:
OCTOPUS and SPOCTOPUS predict both the TM-helix correctly. But same error for OCTOPUS as seen with RET4_HUMAN, INSL5_HUMAN: instead of signaling peptide an TM-helix and SPOCTOPUS corrects that.
- A4_HUMAN:
OCTOPUS predicts an Reentrant/Dip region instead of the signal peptide region, SPOCTOPUS corrects that.
SignalP
For using it locally at our linux box, we had to correct again some path issues.
The command we used was:
- signalp -t y x.fasta > x.signalp
where 'x' stands again for the UniProt entry names of the proteins. 'y' was chosen accordingly to the organism of the protein, for all human proteins 'y' was set to eukaryotes 'euk' and for the bacterial protein P02945 to gram- 'gram-'. This switch specifies the neural network and hidden Markov models, that are separately trained for different organisms.
For the graphical output of HFE_HUMAN we used the SignalP server from: http://www.cbs.dtu.dk/services/SignalP
There are three scorings for the SignalP-prediction NN shown in Figure 20:
- C-score: 'cleavage site': raw cleavage site prediction
- S-mean-score: 'average of the S-score': discrimination of secretory and non-secretory proteins
- Y-max-score: 'combination of C-score with s-core': better cleavage site prediction
SignalP-NN result: sp|Q30201|HFE_HUMAN length = 348 Measure Position Value Cutoff signal peptide? max. C 23 0.534 0.32 YES max. Y 23 0.599 0.33 YES max. S 16 0.995 0.87 YES mean S 1-22 0.935 0.48 YES D 1-22 0.767 0.43 YES Most likely cleavage site between pos. 22 and 23: LQG-RL
SignalP-HMM result: >sp|Q30201|HFE_HUMAN Prediction: Signal peptide Signal peptide probability: 0.998 Signal anchor probability: 0.000 Max cleavage site probability: 0.297 between pos. 22 and 23
SignalP predicts an signal peptide probability with almost 1.0 and thus an signal anchor probability with 0. This leads to the prediction of an cleavage site between pos. 22 and 23 (Figure 21).
According to UniProt is there an signal peptide, it starts at pos. 1 to 22, which means, SignalP has predicted the signal peptide and cleavage site with 100% accuracy.
SignalP-NN result: >sp_P02945_BACR_HALSA length = 70 Measure Position Value Cutoff signal peptide? max. C 16 0.331 0.52 NO max. Y 43 0.066 0.33 NO max. S 32 0.948 0.92 YES mean S 1-42 0.216 0.49 NO D 1-42 0.141 0.44 NO Most likely cleavage site between pos. 42 and 43: FLV-KG SignalP-HMM result: >sp|P02945|BACR_HALSA Prediction: Non-secretory protein Signal peptide probability: 0.000 Max cleavage site probability: 0.000 between pos. 15 and 16
For BACR_Halsa is the result of SignalP clearly wrong, because this bacteria does not contain any signal peptide.
SignalP-NN result: >sp_P02753_RET4_HUMAN length = 70 Measure Position Value Cutoff signal peptide? max. C 19 0.929 0.32 YES max. Y 19 0.901 0.33 YES max. S 1 0.994 0.87 YES mean S 1-18 0.938 0.48 YES D 1-18 0.920 0.43 YES Most likely cleavage site between pos. 18 and 19: GRA-ER SignalP-HMM result: >sp_P02753_RET4_HUMAN Prediction: Signal peptide Signal peptide probability: 1.000 Signal anchor probability: 0.000 Max cleavage site probability: 0.979 between pos. 18 and 19
SignalP predicted the cleavage very well, according to UniProt the signal peptide is from pos 1 to 18 and afterwards the cleavage.
SignalP-NN result: >sp_Q9Y5Q6_INSL5_HUMA length = 70 Measure Position Value Cutoff signal peptide? max. C 23 0.855 0.32 YES max. Y 23 0.778 0.33 YES max. S 13 0.987 0.87 YES mean S 1-22 0.852 0.48 YES D 1-22 0.815 0.43 YES Most likely cleavage site between pos. 22 and 23: VRS-KE SignalP-HMM result: >sp_Q9Y5Q6_INSL5_HUMAN Prediction: Signal peptide Signal peptide probability: 0.999 Signal anchor probability: 0.000 Max cleavage site probability: 0.911 between pos. 22 and 23
This result is also correct predicted (UniProt: signal peptide from pos 1-22).
SignalP-NN result: >sp_P11279_LAMP1_HUMA length = 70 Measure Position Value Cutoff signal peptide? max. C 29 0.978 0.32 YES max. Y 29 0.903 0.33 YES max. S 19 0.999 0.87 YES mean S 1-28 0.960 0.48 YES D 1-28 0.932 0.43 YES Most likely cleavage site between pos. 28 and 29: ASA-AM SignalP-HMM result: >sp_P11279_LAMP1_HUMAN Prediction: Signal peptide Signal peptide probability: 1.000 Signal anchor probability: 0.000 Max cleavage site probability: 0.847 between pos. 28 and 29
SignalP predicted again correctly the cleavage site.
SignalP-NN result: >sp_P05067_A4_HUMAN length = 70 Measure Position Value Cutoff signal peptide? max. C 18 0.891 0.32 YES max. Y 18 0.850 0.33 YES max. S 2 0.992 0.87 YES mean S 1-17 0.967 0.48 YES D 1-17 0.909 0.43 YES Most likely cleavage site between pos. 17 and 18: ARA-LE SignalP-HMM result: >sp_P05067_A4_HUMAN Prediction: Signal peptide Signal peptide probability: 1.000 Signal anchor probability: 0.000 Max cleavage site probability: 0.993 between pos. 17 and 18
That prediction is accurate, too.
In short, SignalP predicts very accurate the position of the signaling peptide cleavage side, but fails with the non-eukaryont BACR_HALSA.
TargetP
TargetP predict for each of the proteins a signal peptide with high probability. But P02945 which is a bacteria and has no signal peptide, the method seems to be pretty accurate. It scores the prediction at the P02945 with an reliability clause of 4, which is almost neglectable (1 is the highest confidence, 5 is the worst).
### targetp v1.1 prediction results ################################## Number of query sequences: 6 Cleavage site predictions included. Using NON-PLANT networks. Name Len mTP SP other Loc RC TPlen ---------------------------------------------------------------------- sp_Q30201_HFE_HUMAN 348 0.433 0.912 0.004 S 3 22 sp_P02945_BACR_HALSA 262 0.019 0.897 0.562 S 4 116 sp_P02753_RET4_HUMAN 201 0.242 0.928 0.020 S 2 18 sp_Q9Y5Q6_INSL5_HUMA 135 0.074 0.899 0.037 S 1 22 sp_P11279_LAMP1_HUMA 417 0.043 0.953 0.017 S 1 28 sp_P05067_A4_HUMAN 770 0.035 0.937 0.084 S 1 17 ---------------------------------------------------------------------- cutoff 0.000 0.000 0.000
Discussion
After using different tools with dissimilar approaches to identify transmembrane regions and signal peptides we can state, that tools which join both entities into one prediction are mostly more reliable. They give also less misclassification errors than predictors of only one entity if there exist both entities.
During our test cases also we determined that the single predictors excel at their claimed entity and only misclassify if there are both entities available in the tissue.
Thus we would advice to use a single predictor like SignalP or TargetP and double-check the result with a multi predictor like SPOCTOPUS or Phobius to be sure.
Prediction of GO terms
General
GO-Terms classify protein functions. Each GO-Term states other protein functions, therefore classifying a protein into GO-Terms means predicting it's functions.
HFE_HUMAN is annotated with 27 different GO Terms which are <ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=Q30201</ref>:
GOID | GO Term | Aspect |
---|---|---|
GO:0002474 | antigen processing and presentation of peptide antigen via MHC class I | Process |
GO:0005515 | protein binding | Function |
GO:0005737 | cytoplasm | Component |
GO:0005769 | early endosome | Component |
GO:0005886 | plasma membrane | Component |
GO:0005887 | integral to plasma membrane | Component |
GO:0006461 | protein complex assembly | Process |
GO:0006810 | transport | Process |
GO:0006811 | ion transport | Process |
GO:0006826 | iron ion transport | Process |
GO:0006879 | cellular iron ion homeostasis | Process |
GO:0006898 | receptor-mediated endocytosis | Process |
GO:0006955 | immune response | Process |
GO:0007565 | female pregnancy | Process |
GO:0010106 | cellular response to iron ion starvation | Process |
GO:0016020 | membrane | Component |
GO:0016021 | integral to membrane | Component |
GO:0019882 | antigen processing and presentation | Process |
GO:0031410 | cytoplasmic vesicle | Component |
GO:0042446 | hormone biosynthetic process | Process |
GO:0042612 | MHC class I protein complex | Component |
GO:0045177 | apical part of cell | Component |
GO:0045178 | basal part of cell | Component |
GO:0048471 | perinuclear region of cytoplasm | Component |
GO:0055037 | recycling endosome | Component |
GO:0055072 | iron ion homeostasis | Process |
GO:0060586 | multicellular organismal iron ion homeostasis | Process |
GOPET
Gopet predicted 2 GO-Terms for the HFE_HUMAN which have no overlap to the annotation.
GOID | Aspect | Confidence | GO Term |
---|---|---|---|
GO:0004872 | Molecular Function | 91% | receptor activity |
GO:0030106 | Molecular Function | 88% | MHC class I receptor activity |
Additional proteins:
BACR_HALSA
3 GO terms were predicted but only the one with the highest confidence of 77% is really connected to the protein:
- ion channel activity
RET4_HUMAN
There were 8 GO terms predicted by GOPET with a confidence from 90% to 60%, 5 of them are linked to the protein:
- binding
- retinoid binding
- retinol binding
- transporter activity
- retinal binding
INSL5_HUMAN
Only 1 GO term with a confidence of 80% is predicted by GOPET and it is also linked to the protein:
- hormone activity
LAMP1_HUMAN
GOPET has predicted 2 GO terms with 60% confidence each, but none is linked to the protein.
A4_HUMAN
GOPET predicted 13 GO terms in a range of 87% to 67% confidence, but only 7 of them are really connected to the protein:
- serine-type endopeptidase inhibitor activity
- peptidase inhibitor activity
- binding
- protein binding
- metal ion binding
- DNA binding
- heparin binding
Pfam
Pfam is a database that contains protein domains and families. For our search we used the webserver at http://pfam.sanger.ac.uk/search with standard values.
Afterwards we used the pfam2go database, to find the GO-entries matching the pfam descriptions.
Pfam classifies the HFE_Human protein into two families (Figure 22):
- Family: MHC_I (PF00129)
- Family: C1-set (PF07654)
For the PF00129 family are four hits at the pfam2go data:
Pfam:PF00129 MHC_I > GO:immune response ; GO:0006955 Pfam:PF00129 MHC_I > GO:antigen processing and presentation ; GO:0019882 Pfam:PF00129 MHC_I > GO:membrane ; GO:0016020 Pfam:PF00129 MHC_I > GO:MHC class I protein complex ; GO:0042612
All those GO-Entries are at the UniProt entry about HFE_Human, so this family is correct.
For the PF07654 family are no entries at the pfam2go data and thus no validateable cross links to UniProt, maybe this family is yet not included in the pfam2go data.
For a more detailed picture see Figure 23, you can see the Pfam-A matches with alignment.
Additional proteins:
>BACR_HALSA Pfam:PF01036 Bac_rhodopsin > GO:ion channel activity ; GO:0005216 Pfam:PF01036 Bac_rhodopsin > GO:ion transport ; GO:0006811 Pfam:PF01036 Bac_rhodopsin > GO:membrane ; GO:0016020
>RET4_HUMAN Pfam:PF00061 Lipocalin > GO:binding ; GO:0005488
>INSL5_HUMAN Pfam:PF00049 Insulin > GO:hormone activity ; GO:0005179 Pfam:PF00049 Insulin > GO:extracellular region ; GO:0005576
>LAMP1_HUMAN Pfam:PF01299 Lamp > GO:membrane ; GO:0016020
>A4_HUMAN Pfam:PF02177 APP_N > GO:binding ; GO:0005488 Pfam:PF02177 APP_N > GO:integral to membrane ; GO:0016021 Family: APP_Cu_bd (PF12924) --> no match at pfam2go Pfam:PF00014 Kunitz_BPTI > GO:serine-type endopeptidase inhibitor activity ; GO:0004867 Family: APP_E2 (PF12925) --> no match at pfam2go Pfam:PF03494 Beta-APP > GO:binding ; GO:0005488 Pfam:PF03494 Beta-APP > GO:integral to membrane ; GO:0016021 Family: APP_amyloid (PF10515) --> no match at pfam2go
At a summary, all predicted GO-terms are correct and are cross linked to the corresponding UniProt entries. But also all predicted GO-term are not exhaustive and at UniProt there are many left.
ProtFun 2.2
ProtFun is an ab initio prediction server.
Results
ProtFun assigned immune response(GO:0006955;Process) to HFE what is correct. But ProtFun predicts just one correct GO-number for the HFE-Gen.
Functional category Prob Odds
Amino_acid_biosynthesis 0.011 0.484
Biosynthesis_of_cofactors 0.105 1.452
Cell_envelope => 0.633 10.377
Cellular_processes 0.095 1.297
Central_intermediary_metabolism 0.231 3.663
Energy_metabolism 0.059 0.659
Fatty_acid_metabolism 0.016 1.265
Purines_and_pyrimidines 0.583 2.400
Regulatory_functions 0.013 0.079
Replication_and_transcription 0.019 0.073
Translation 0.079 1.801
Transport_and_binding 0.732 1.785
Enzyme/nonenzyme Prob Odds
Enzyme 0.208 0.727
Nonenzyme => 0.792 1.110
Enzyme class Prob Odds
Oxidoreductase (EC 1.-.-.-) 0.084 0.404
Transferase (EC 2.-.-.-) 0.062 0.179
Hydrolase (EC 3.-.-.-) 0.135 0.425
Lyase (EC 4.-.-.-) 0.049 1.054
Isomerase (EC 5.-.-.-) 0.010 0.321
Ligase (EC 6.-.-.-) 0.042 0.827
Gene Ontology category Prob Odds
Signal_transducer 0.201 0.939
Receptor 0.353 2.076
Hormone 0.002 0.365
Structural_protein 0.005 0.190
Transporter 0.024 0.219
Ion_channel 0.008 0.147
Voltage-gated_ion_channel 0.002 0.085
Cation_channel 0.010 0.221
Transcription 0.036 0.283
Transcription_regulation 0.018 0.147
Stress_response 0.274 3.108
Immune_response => 0.381 4.486
Growth_factor 0.013 0.943
Metal_ion_transport 0.009 0.02
Additional proteins:
>sp_P02945_BACR_HALSA # Functional category Prob Odds Amino_acid_biosynthesis 0.033 1.495 Biosynthesis_of_cofactors 0.186 2.589 Cell_envelope 0.029 0.483 Cellular_processes 0.051 0.694 Central_intermediary_metabolism 0.045 0.711 Energy_metabolism 0.138 1.537 Fatty_acid_metabolism 0.016 1.265 Purines_and_pyrimidines 0.302 1.244 Regulatory_functions 0.013 0.080 Replication_and_transcription 0.019 0.073 Translation 0.059 1.339 Transport_and_binding => 0.791 1.929 # Enzyme/nonenzyme Prob Odds Enzyme 0.199 0.696 Nonenzyme => 0.801 1.122 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.114 0.549 Transferase (EC 2.-.-.-) 0.031 0.091 Hydrolase (EC 3.-.-.-) 0.057 0.180 Lyase (EC 4.-.-.-) 0.020 0.430 Isomerase (EC 5.-.-.-) 0.010 0.321 Ligase (EC 6.-.-.-) 0.017 0.326 # Gene Ontology category Prob Odds Signal_transducer 0.258 1.205 Receptor 0.355 2.087 Hormone 0.001 0.206 Structural_protein 0.006 0.200 Transporter => 0.440 4.036 Ion_channel 0.010 0.169 Voltage-gated_ion_channel 0.004 0.172 Cation_channel 0.078 1.689 Transcription 0.026 0.205 Transcription_regulation 0.028 0.226 Stress_response 0.012 0.139 Immune_response 0.011 0.128 Growth_factor 0.010 0.727 Metal_ion_transport 0.049 0.106
ProtFun predicted correctly the functional category, enzyme/no enzyme classification and the gene ontology category. All three predictions are correct.
>sp_P02753_RET4_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.017 0.751 Biosynthesis_of_cofactors 0.044 0.610 Cell_envelope => 0.804 13.186 Cellular_processes 0.075 1.021 Central_intermediary_metabolism 0.197 3.128 Energy_metabolism 0.043 0.475 Fatty_acid_metabolism 0.016 1.265 Purines_and_pyrimidines 0.275 1.131 Regulatory_functions 0.013 0.080 Replication_and_transcription 0.022 0.084 Translation 0.032 0.721 Transport_and_binding 0.800 1.951 # Enzyme/nonenzyme Prob Odds Enzyme => 0.544 1.900 Nonenzyme 0.456 0.639 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.095 0.458 Transferase (EC 2.-.-.-) 0.038 0.109 Hydrolase (EC 3.-.-.-) 0.235 0.742 Lyase (EC 4.-.-.-) => 0.059 1.264 Isomerase (EC 5.-.-.-) 0.010 0.321 Ligase (EC 6.-.-.-) 0.017 0.326 # Gene Ontology category Prob Odds Signal_transducer 0.202 0.942 Receptor 0.147 0.862 Hormone 0.004 0.667 Structural_protein 0.002 0.058 Transporter 0.025 0.232 Ion_channel 0.016 0.288 Voltage-gated_ion_channel 0.003 0.148 Cation_channel 0.010 0.215 Transcription 0.027 0.207 Transcription_regulation 0.025 0.196 Stress_response 0.161 1.829 Immune_response => 0.239 2.813 Growth_factor 0.023 1.617 Metal_ion_transport 0.009 0.020
ProtFun predicted not correct, the functional category and the enzyme/no enzyme classification and enzyme class is according to UniProt wrong. Only the gene ontology category is correct.
>sp_Q9Y5Q6_INSL5_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.011 0.484 Biosynthesis_of_cofactors 0.040 0.558 Cell_envelope => 0.756 12.393 Cellular_processes 0.033 0.448 Central_intermediary_metabolism 0.048 0.755 Energy_metabolism 0.036 0.397 Fatty_acid_metabolism 0.016 1.265 Purines_and_pyrimidines 0.144 0.592 Regulatory_functions 0.014 0.087 Replication_and_transcription 0.020 0.075 Translation 0.032 0.735 Transport_and_binding 0.834 2.033 # Enzyme/nonenzyme Prob Odds Enzyme 0.209 0.729 Nonenzyme => 0.791 1.109 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.056 0.268 Transferase (EC 2.-.-.-) 0.031 0.091 Hydrolase (EC 3.-.-.-) 0.062 0.195 Lyase (EC 4.-.-.-) 0.020 0.430 Isomerase (EC 5.-.-.-) 0.010 0.321 Ligase (EC 6.-.-.-) 0.017 0.327 # Gene Ontology category Prob Odds Signal_transducer 0.374 1.746 Receptor 0.128 0.750 Hormone => 0.247 37.936 Structural_protein 0.001 0.041 Transporter 0.025 0.228 Ion_channel 0.010 0.168 Voltage-gated_ion_channel 0.003 0.131 Cation_channel 0.010 0.215 Transcription 0.054 0.425 Transcription_regulation 0.091 0.724 Stress_response 0.099 1.128 Immune_response 0.178 2.090 Growth_factor 0.061 4.379 Metal_ion_transport 0.009 0.020
FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again.
>sp_P11279_LAMP1_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.011 0.484 Biosynthesis_of_cofactors 0.053 0.735 Cell_envelope => 0.804 13.186 Cellular_processes 0.027 0.373 Central_intermediary_metabolism 0.138 2.188 Energy_metabolism 0.037 0.411 Fatty_acid_metabolism 0.016 1.265 Purines_and_pyrimidines 0.533 2.195 Regulatory_functions 0.015 0.090 Replication_and_transcription 0.019 0.073 Translation 0.027 0.613 Transport_and_binding 0.834 2.033 # Enzyme/nonenzyme Prob Odds Enzyme 0.276 0.965 Nonenzyme => 0.724 1.014 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.039 0.187 Transferase (EC 2.-.-.-) 0.046 0.134 Hydrolase (EC 3.-.-.-) 0.058 0.184 Lyase (EC 4.-.-.-) 0.020 0.430 Isomerase (EC 5.-.-.-) 0.010 0.321 Ligase (EC 6.-.-.-) 0.017 0.326 # Gene Ontology category Prob Odds Signal_transducer 0.396 1.849 Receptor 0.282 1.659 Hormone 0.001 0.206 Structural_protein 0.011 0.408 Transporter 0.024 0.222 Ion_channel 0.008 0.147 Voltage-gated_ion_channel 0.002 0.111 Cation_channel 0.010 0.215 Transcription 0.032 0.247 Transcription_regulation 0.018 0.142 Stress_response 0.246 2.795 Immune_response => 0.371 4.368 Growth_factor 0.013 0.956 Metal_ion_transport 0.009 0.020
FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again.
>sp_P05067_A4_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.020 0.921 Biosynthesis_of_cofactors 0.261 3.623 Cell_envelope => 0.804 13.186 Cellular_processes 0.053 0.730 Central_intermediary_metabolism 0.184 2.920 Energy_metabolism 0.023 0.259 Fatty_acid_metabolism 0.016 1.265 Purines_and_pyrimidines 0.417 1.716 Regulatory_functions 0.013 0.084 Replication_and_transcription 0.029 0.109 Translation 0.027 0.613 Transport_and_binding 0.827 2.016 # Enzyme/nonenzyme Prob Odds Enzyme => 0.392 1.368 Nonenzyme 0.608 0.852 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.024 0.114 Transferase (EC 2.-.-.-) 0.208 0.603 Hydrolase (EC 3.-.-.-) 0.190 0.600 Lyase (EC 4.-.-.-) 0.020 0.430 Isomerase (EC 5.-.-.-) 0.010 0.324 Ligase (EC 6.-.-.-) 0.048 0.946 # Gene Ontology category Prob Odds Signal_transducer 0.126 0.586 Receptor 0.036 0.211 Hormone 0.001 0.206 Structural_protein => 0.034 1.205 Transporter 0.024 0.222 Ion_channel 0.009 0.162 Voltage-gated_ion_channel 0.002 0.108 Cation_channel 0.010 0.215 Transcription 0.043 0.335 Transcription_regulation 0.018 0.143 Stress_response 0.076 0.862 Immune_response 0.016 0.183 Growth_factor 0.005 0.372 Metal_ion_transport 0.009 0.020
ProtFun failed completely, because all predictions of the functional category and enzyme/no enzyme and the gene ontology category are wrong.
References
<references />