Revision as of 17:57, 30 August 2011

TODO

add all references/quotes

Disscusion Disordered zeugs

Secondary structure prediction

For the secondary structure prediction, we used a reference sequence obtained from UniProt. We also used the annotations of the UniProt sequence as a reference to compare the predicted secondary structures.

PSIPRED

Figure 1: Secondary Structure predicted by PSIPRED
Source: http://bioinf.cs.ucl.ac.uk/psipred/

PSI-PRED use the PSI-BLAST output as input for a neuronal network which has a single hidden layer and a feed-forward back-propagation architecture to predict the secondary structure.

Results
PSI-PRED predicts a alpha/beta structure. The transmembrane region is predicted as a beta region. A graphical representation of the result is shown in Figure 1.

PSIPRED HFORMAT (PSIPRED V3.0)
Conf: 999851589999999877513567886245556456636899750389988756755687
Pred: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE
  AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
             10        20        30        40        50        60
Conf: 318998225536664688990669998865311211002358577441156788603899
Pred: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE
  AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
             70        80        90       100       110       120
Conf: 987799319835459889765910588728988756689786135787788899999876
Pred: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH
  AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
            130       140       150       160       170       180
Conf: 310271499889888616322000378810000468999601699981450765189996
Pred: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE
  AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
            190       200       210       220       230       240
Conf: 288106667520025355899875899999965999872169986699998826885259
Pred: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC
  AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
            250       260       270       280       290       300
Conf: 999711124320001367777622367764115889887620212359
Pred: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC
  AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
            310       320       330       340

Jpred3

Jpred use the Jnet algorithm which provides "a three-state (a-helix, ß-strand and coil) prediction of secondary structure at an accuracy of 81.5%" <ref>http://nar.oxfordjournals.org/content/36/suppl_2/W197.abstract</ref>.

Results
Jpred found in it's first blast search a lot of homologous hits with an e-value range from e-163 to 4e-44. There are some self hits included. We continued to the prediction which is:

Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDD
 SS: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE--

Seq: QLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHN
 SS: EEEEEE-----EEEE----------HHHHHHHHHHHHHHHHHHHHHHHHHH----

Seq: HSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPT
 SS: -----EEEEEEEEEE------EEEEEEE-----EEEEEE----EEE-------HH

Seq: KLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSV
 SS: HHHHH--HHHHHHHHHH------HHHHHHHHHH-H-------EEEEE--------

Seq: TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPG
 SS: -EEEEEEE------EEEEEEE----------EE----------EEEEEEEEE---

Seq: EEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILR
 SS: ---EEEEEEEE------EEEEE---------HHHHHHHHHHHHHHHHHHHHHHHH
Seq: KRQGSRGAMGHYVLAERE
 SS: HH----------------

Comparison with DSSP

DSSP was designed by Wolfgang Kabsch and Chris Sander to provide a standard for the secondary structure assignment. DSSP calculates the secondary structure from PDB structures by using the distances between the atoms.

Results
Because, the PDB sequence is not complete, the dssp assignment is also incomplete. The interesting parts - the signal peptide and the cytoplasmic part - which are predicted as disordered are not covered by DSSP. PSIPRED and JPred predicted the transmembrane region well but assigned the - as disordered predicted - N- and C-terminus as a helical or beta sheet region. But the UniProt assignment gives no structure to this regions as well. Therefore, these regions may unstructured and not yet recognized as disordered regions.

UniProt: ---------------------------EEEEEEEEEEE----EEE--EEEEEE--EEEEE
   DSSP:                          --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEE
PSIPRED: CCCCCHHHHHHHHHHHHHHHCCCCCCCEEEEEEEEEEECCCCCCCEEEEEEEECCEEEEE
  JPred: ------HHHHHHHHHHHHH---------EEEEEEEEE-------EEEEEEEEE--EEEEE
     AA: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
DSSPSeq:                          RSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
                10        20        30        40        50        60
UniProt: EEEEE--EEE--------TTTHHHHHHHHHHHHHHHHHHHHHHHHHHTTT-EEE--EEEE
   DSSP: EESSS--EEE-STTS-SSTTTTHHHHHHHHHHHHHHHHHHHHHHHHHTTT-SSS--EEEE
PSIPRED: ECCCCCCEEECCCCCCCCCCHHHHHHHHHHHHCCCCCHHHHHHHHHHHCCCCCCCCEEEE
  JPred: E-----EEEE----------HHHHHHHHHHHHHHHHHHHHHHHHHH---------EEEEE
     AA: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
DSSPSEQ: YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
                70        80        90       100       110       120
UniProt: EEEEEE-----EEEEEEEEE--EEEEEEEHHH-EEEEEE---HHHHHHHH---HHHHHHH
   DSSP: EEEEEE-TTS-EEEEEEEEETTEEEEEEEGGGTEEEESSGGGHHHHHHHHSSTHHHHHHH
PSIPRED: EEEEEEECCCEEEEEEEEEECCCEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHH
  JPred: EEEEE------EEEEEEE-----EEEEEE----EEE-------HHHHHHH--HHHHHHHH
     AA: ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
DSSPSEQ: ILGaEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
               130       140       150       160       170       180
UniProt: HHHH-HHHHHHHHHHHHHTTT-------EEEEEEEE----EEEEEEEEEEEEE--EEEEE
   DSSP: HHHHTHHHHHHHHHHHHHTTTSS--B--EEEEEEEE-SS-EEEEEEEEEEBSS--EEEEE
PSIPRED: HHHCCCHHHHHHHHHHCCCCCCCCCCCCCEEEECCCCCCCEEEEEEEEEECCCCEEEEEE
  JPred: HH------HHHHHHHHHH-H-------EEEEE---------EEEEEEE------EEEEEE
     AA: AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
DSSPSEQ: AYLERDaPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRbRALNYYPQNITMKWL
               190       200       210       220       230       240
UniProt: E------HHH----EEEE-----EEEEEEEEE---HHHHEEEEEE---EEE-EEEE----
   DSSP: ETTEE--GGGS---EEEE-TTS-EEEEEEEEE-TTGGGGEEEEEE-TTSSS-EEEE-
PSIPRED: ECCEECCCCCCCCCCCEECCCCCEEEEEEEEECCCCCCCEEEEEECCCCCCCEEEEEECC
  JPred: E----------EE----------EEEEEEEEE------EEEEEEEE------EEEEE---
     AA: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
DSSPSEQ: KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTbQVEHPGLDQPLIVIW
               250       260       270       280       290       300
UniProt: ------------------------------------------------
   DSSP:
PSIPRED: CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEECCCCCCCCCCEEECCCC
  JPred: ------HHHHHHHHHHHHHHHHHHHHHHHHHH----------------
     AA: PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
DSSPSEQ: 
               310       320       330       340

We see a good overlap of the results of all the different methods except for the region at the C-terminus. Here, the PDB file did not provide a sequence, and UniProt did not assign a secondary structure to this region. But interestingly, PSIPRED predict a beta-sheet in this region while JPred predict a helix. As we will show later, this region is proposed to be disordered by the most prediction tools. This could be the reason for the different assignments.

Prediction of disordered regions

The HFE-Gen is not yet known as disordered. It is not contained in the Disprot<ref>http://www.disprot.org/</ref> database. The prediction of unstructured regions predict several disordered regions in the protein, but most of them are predicted within secondary structure elements. Just the predicted disordered regions at the C- and N-terminus might be really unstructured but not yet experimentally recognized because, these regions have no structural assignment. But based on this, we can not guess which tool works best for our case.

The predictions are shown below.

DISOPRED

For the prediction, we used the DISOPRED-Server at http://bioinf.cs.ucl.ac.uk/disopred/
DISOPRED is a prediction tool for disordered regions based on a linear SVM. The SVM is trained with 750 non-redundant sequences with high resolution X-ray structures. "Disorder was identified with those residues that appear in the sequence records but with coordinates missing from the electron density map." <ref>http://bioinf.cs.ucl.ac.uk/index.php?id=806</ref> For each protein, a sequence profile was generated by using PSI-BLAST search against a filtered database. The PSI-BLAST profiles were used as input vectors for the SVM.

Result
Disopred predictes two disordered residues at the signal peptide and a disordered region at the end of the sequence which is located inside the cell.

Figure 2: DISOPRED prediction profile for the HFE protein
Source: http://bioinf.cs.ucl.ac.uk/disopred/

AA:Target sequence
Pred:Residue disorder prediction(.)= ordered residue(*)=Disordered residue
conf:997600000000000000000000000000000000000000000000000000000000
pred:**..........................................................
  AA:MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
             10        20	  30 	    40	      50	60
conf:000120011000000000000000000000000000000000000000000000000000
pred:............................................................
  AA:YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
             70        80	  90	   100	     110       120
conf:000000000000000000000000000000000000000000000000000000000000
pred:............................................................
  AA:ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
            130       140       150       160       170       180
conf:000000000000000000000002456777878777766530000000000000000000
pred:..............................*.*...........................
  AA:AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
            190       200       210       220       230       240
conf:000035555545543000000000000000000000000000000000000001354667
pred:............................................................
  AA:KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
            250       260       270       280       290       300
conf:777766643300000000000000047889999999999999898999
pred:...........................*********************
  AA:PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
            310       320       330       340
DISOPRED predictions for a false positive rate threshold of: 2%

POODLE

POODLE stands for Prediction Of Order and Disorder by machine LEarning.

POODLE provides three different predictions

POODLE-S: short disorder regions prediction
POODLE-L: long disorder regions prediction (longer 40 residues)
unfolded protein prediction

All POODLE variants predict a disordered region at the end of the protein which contains a transmembrane region (pos: 307-330), this shows an evidence for a disordered region at the C-Terminus. But also, all variants predict a short disordered region at the beginning of the sequence which is a part of the signal peptide (pos: 1-22). The score distribution over the sequence is shown in the Figures 3 to 6. The threshold to distinguish between an ordered and a disordered region is 0.5

POODLE-I

POODLE-I (series only) predicted 4 disordered regions within the protein sequence.

Figure 3: Distribution of disordered region over the AS-Sequence predicted by POODLE-I
Source: http://mbs.cbrc.jp/poodle/poodle.html

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
**************----------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
-------**********---******------*---------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
------------------------------------------------------------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
---------------------***************------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
----*********----------------------------------------*******
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
************************************************

POODLE-S

POODLE-S (using missing residues) predicts 6 short disordered regions within the protein sequence.

Figure 4: Distribution of disordered region over the AS-Sequence predicted by POODLE-S(Missing residues)
Source: http://mbs.cbrc.jp/poodle/poodle.html

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
-**************---------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
-------**********---******----------------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
------------------------------------------------------------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
---------------------***************------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
----*********----------------------------------------*******
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
*--------------------------------********-------

POODLE-S (using High B-Factor residues) predicts 2 short disordered regions within the protein sequence.

Figure 5: Distribution of disordered region over the AS-Sequence predicted by POODLE-S(High B-Factor residues)
Source: http://mbs.cbrc.jp/poodle/poodle.html

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
-*-***------------------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
------------------------------------------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
------------------------------------------------******------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
------------------------------------------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
------------------------------------------------------------
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
------------------------------------------------

POODLE-L

POODLE-L predicts a disordered region from 296 to the end.

Figure 6: Distribution of disordered region over the AS-Sequence predicted by POODLE-L
Source: http://mbs.cbrc.jp/poodle/poodle.html

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 
------------------------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 
------------------------------------------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 
------------------------------------------------------------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL 
------------------------------------------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS 
------------------------------------------------------******
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
************************************************

IUPRED

IUPRED use the estimated pairwise energy to recognize unstructured regions within protein sequences. For these, they use the assumption, that all globular proteins have an amino acid composition which gives it the potential to form a large number of favorable interactions. The score distribution over the sequence is shown in the Figures 7 to 9. IUPRED use 0.5 as threshold to distinguish between a disordered and a ordered region.

Results
The short term prediction predicts 5 short regions. There are also disordered residues at the beginning and in the signal peptide.

Figure 7: IUPRED prediction of short regions
Source: http://iupred.enzim.hu/

Figure 8: IUPRED prediction of long regions
Source: http://iupred.enzim.hu/

Figure 9: IUPRED prediction of structured regions
Source: http://iupred.enzim.hu/

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 
***---------------------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 
------------------------------------------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 
------------------------------------------------------------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL 
------------------------------------------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS 
---------********----------***--------*-****----------------
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
---------------------------------------------***

The long term prediction predicted 7 disordered residues, but just one short region.

MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF 
------------------------------------------------------------
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV 
------------------------------------------------------------
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR 
------------------------------------------------------------
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL 
------------------------------------------------------------
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS 
---------******-------------------------*-------------------
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
------------------------------------------------

The prediction of sturcured regions predicts one globular domain from 1-348 (Figure 9). This means, that the whole protein is structured. This is a contradiction to the prediction of POODLE, but because of the weak evidence given by the other IUPRED-methods not a real contradiction to the other results of IUPRED.

META-Disorder

For this task, we used the PredictProtein Server at https://www.predictprotein.org. META-Disorder, published in 2009 by Avner Schlessinger, Marco Punta, Guy Yachdav, Laszlo Kajan and Burkhard Rost, use a combined prediction of ORSnet PROFbval and Ucon.

predicted secondary structure composision

sec str type	H	E	L
% in protein	27.30	28.74	43.97

Prediction of disordered residues by META-Disorder (last coloumn)

Number Residue NORSnet NORS2st PROFbval bval2st Ucon Ucon2st MD_raw   MD_rel  MD2st

                                 ....
 242	D	0.13	-	0.70	D	0.76	D	0.444	2	-
 243	K	0.13	-	0.69	D	0.76	D	0.480	1	-
 244	Q	0.13	-	0.66	D	0.93	D	0.531	0	D
 245	P	0.17	-	0.73	D	0.92	D	0.520	0	D
 246	M	0.28	-	0.65	D	0.90	D	0.525	0	D
 247	D	0.31	-	0.68	D	0.87	D	0.515	0	-
 248	A	0.36	-	0.70	D	0.87	D	0.520	0	D
 249	K	0.40	-	0.69	D	0.89	D	0.485	1	-
                                 ....
 344	L	0.37	-	0.59	D	0.18	-	0.515	0	-
 345	A	0.35	-	0.74	D	0.17	-	0.515	0	-
 346	E	0.31	-	0.89	D	0.17	-	0.520	0	D
 347	R	0.35	-	0.91	D	0.23	-	0.525	0	D
 348	E	0.34	-	0.92	D	0.38	-	0.520	0	D

Key for output
----------------
Number - residue number
Residue - amino-acid type
NORSnet - raw score by NORSnet (prediction of unstructured loops)
NORS2st - two-state prediction by NORSnet; D=disordered
PROFbval - raw score by PROFbval (prediction of residue flexibility from sequence)
Bval2st - two-state prediction by PROFbval
Ucon - raw score by Ucon (prediction of protein disorder using predicted internal contacts)
Ucon2st - two-state prediction by Ucon
MD - raw score by MD (prediction of protein disorder using orthogonal sources)
MD_rel - reliability of the prediction by MD; values range from 0-9. 9=strong prediction
MD2st - two-state prediction by MD

META-Disorder predict a very short disordered region of 3 residues at the end of the protein but with a week evidence of around 0.5. Therefore it is quite unlikely to have a disordered region at the C-Terminus if we just look at this method.

Discussion

Most tools predict a disordered region at the C-terminus of the protein.

Prediction of transmembrane alpha-helices and signal peptides

General

We were given five additional proteins to work with and predict transmembrane regions, signal peptides and GO terms for. That was done, because most of the practicals proteins are no membrane proteins and therefore produce only "no membrane" results. Thus the three membrane proteins [BACR_HALSA], [LAMP1_HUMAN] and [A4_HUMAN] were provided, but also our HFE Protein [HFE_HUMAN] is an membrane protein.

To give you a quick overview about the protein properties, look at the following table:

Accession	Entry name	Organism	Subcelluar location	Signal peptide
Q30201	HFE_HUMAN	Homo sapiens (Human)	Membrane; Single-pass type I membrane protein	1-22
P02945	BACR_HALSA	Halobacterium salinarium / (Halobacterium halobium)	Cell membrane; Multi-pass membrane protein	no
P02753	RET4_HUMAN	Homo sapiens (Human)	Secreted	1-18
Q9Y5Q6	INSL5_HUMAN	Homo sapiens (Human)	Secreted	1-22
P11279	LAMP1_HUMAN	Homo sapiens (Human)	Cell membrane; Single-pass type I membrane protein [...]	1-28
P05067	A4_HUMAN	Homo sapiens (Human)	Membrane; Single-pass type I membrane protein	1-17

We are going to predict membranes and signaling for these six proteins using different tools. Because our normally addressed protein HFE_HUMAN is an membrane protein and therefore we see the prediction accuracy by using it, we will give only graphical and detailed overview about the results of HFE_HUMAN and group the additional proteins in textual form.

We use the entries at UniProt for the real ground truth and compare the prediction results shortly with them.

Why is the prediction of transmembrane helices and signal peptides grouped together here?

Transmembrane helices and signal peptides are very hard to differ, if the method only predicts one of them. Therefore many false-positives are produced, because both of them consist almost only out of hydrophobic residues. Because of that, many methods gone the way, to combine the prediction of transmembrane helices and signal peptides, reducing drastically the false-positives.

There are different types of signal peptides, but they all work as an import or export information.

Import into the peroxisome
Import into the nucleus
Export from the nucleus
Import into the mitochondrium

TMHMM

Figure 10: TMHMM posterior probabilities

TMHMM is a tool for predicting membrane topology (transmembrane helices) in proteins based on a hidden Markov model with different states. It divides the regions in "inside", "outside" and "TMhelix". But TMHMM can not predict Signal Peptides

TMHMM was used locally in our linux box, after correcting some path issues inside some config files.

The command we used was:

tmhmm x.fasta > x.tmhmm

where 'x' stands for one of the UniProt entry name of the proteins. Afterwards we tried to plot the result of HFE_HUMAN with gnuplot, but this was not working either, because of path issues inside the (automatically created) gnuplot script of tmhmm. After correcting the path issues again, gnuplot worked fine and produced successfully graphical output.

A color code is applied to provide easier reading:

green: The predicted region matches the experimental resolved region from UniProt (+/- 5 residues allowed). TMHMM succeeded.
yellow: The predicted region does only partially match the experimental region (mostly errors with signaling peptides). TMHMM has to be improved.
red: The predicted region does absolutely not match the experimental region. TMHMM failed.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
Q30201\|HFE_HUMAN	TMHMM2.0	outside	1	306	Signal peptide	1	22
Q30201\|HFE_HUMAN	TMHMM2.0	outside	1	306	Extracelluar	23	306
Q30201\|HFE_HUMAN	TMHMM2.0	TMhelix	307	329	Helical	307	330
Q30201\|HFE_HUMAN	TMHMM2.0	inside	330	348	Cytoplasmic	331	348

TMHMM misses clearly the the signal peptide and counts the region as outside (1-306), which is correct according to UniProt. Also the TMhelix (307-329) and the inside region (330-348) is placed right, only with one amino acid deviation, but that is insignificant. Therefore TMHMM was very successful in prediction of right regions. The results are shown in Figure 10.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
sp_P02945_BACR_HALSA	TMHMM2.0	outside	1	22	Extracellular	14	23
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	23	42	Helical; Name=Helix A	24	42
sp_P02945_BACR_HALSA	TMHMM2.0	inside	43	54	Cytoplasmic	43	56
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	55	77	Helical; Name=Helix B	57	75
sp_P02945_BACR_HALSA	TMHMM2.0	outside	78	91	Extracellular	76	91
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	92	114	Helical; Name=Helix C	92	109
sp_P02945_BACR_HALSA	TMHMM2.0	inside	115	120	Cytoplasmic	110	120
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	121	143	Helical; Name=Helix D	121	140
sp_P02945_BACR_HALSA	TMHMM2.0	outside	144	147	Extracellular	141	147
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	148	170	Helical; Name=Helix E	148	167
sp_P02945_BACR_HALSA	TMHMM2.0	inside	171	189	Cytoplasmic	168	185
sp_P02945_BACR_HALSA	TMHMM2.0	TMhelix	190	212	Helical; Name=Helix F	186	204
sp_P02945_BACR_HALSA	TMHMM2.0	outside	213	262	Extracellular	205	216
					Helical; Name=Helix G	217	236
					Cytoplasmic	237	262

As clearly visible, TMHMM is also capable of prediction regions of non eukaryots almost completely correctly. Only the last helical and cytoplasmic regions are missed.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
sp_P02753_RET4_HUMAN	TMHMM2.0	outside	1	201	Signal peptide	1	18

TMHMM misses the signaling peptide but everything else is superb, because RET4_Human is a protein that gets secreted.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
sp_Q9Y5Q6_INSL5_HUMAN	TMHMM2.0	outside	1	135	Signal peptide	1	22

As usual, the TMHMM misses the signaling peptide but predicts accurately. INSL5_Human is a hormone and thus secreted at the extracellular regions.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
sp_P11279_LAMP1_HUMAN	TMHMM2.0	inside	1	10	Signal peptide	1	28
sp_P11279_LAMP1_HUMAN	TMHMM2.0	TMhelix	11	33	Signal peptide	1	28
sp_P11279_LAMP1_HUMAN	TMHMM2.0	outside	34	383	Lumenal	29	382
sp_P11279_LAMP1_HUMAN	TMHMM2.0	TMhelix	384	406	Helical	383	405
sp_P11279_LAMP1_HUMAN	TMHMM2.0	inside	407	417	Cytoplasmic	406	417

TMHMM does not so well predict this protein. It mixes up the region of the signaling peptide with 'inside' and 'TMhelix' and is completely wrong at lumenal with 'outside'. The rest is predicted well.

	TMHMM				UniProt
id	version	region	start	end	region	start	end
sp_P05067_A4_HUMAN	TMHMM2.0	outside	1	700	Signal peptide	1	17
sp_P05067_A4_HUMAN	TMHMM2.0	outside	1	700	Extracellular	18	699
sp_P05067_A4_HUMAN	TMHMM2.0	TMhelix	701	723	Helical	700	723
sp_P05067_A4_HUMAN	TMHMM2.0	inside	724	770	Cytoplasmic	724	770

TMHMM predicted again almost correctly and missed only the beginning unpredictable signaling peptide.

Phobius and PolyPhobius

For Phobius and PolyPhobius, we used the webservice<ref>http://www.ncbi.nlm.nih.gov/pubmed/17483518?dopt=Abstract</ref> at http://phobius.sbc.su.se/ with standard settings.

Phobius is a combined predictor for transmembrane protein topology and signal peptide. Phobius models different regions of the sequence in a series of interconnected states of a HMM.<ref>http://www.ncbi.nlm.nih.gov/pubmed/15111065?dopt=Abstract</ref>
PolyPhobius is a hidden Markov model (HMM) decoding algorithm. It combines probabilities for sequence features of homologous by considering the average of the posterior label probability of each position in a global sequence alignment. PolyPhobius is benchmarked by Phobius. <ref>http://www.ncbi.nlm.nih.gov/pubmed/15961464?dopt=Abstract</ref>

Phobius

Figure 11: predicted regions by Phobius
Source: http://phobius.sbc.su.se/

Phobius predicts very accurate as seen below. The transmembrane region is predicted just 1-2 residues upstream from the annotated region. The same holds for the topological domains before and after the transmembrane region. Also the signal peptide is correctly predicted. The probability distribution is shown in Figure 11.

PREDICTED                                                     ANNOTATION
ID   sp|Q30201|HFE_HUMAN
FT   SIGNAL        1     21                             |  1-20
FT   REGION        1      7       N-REGION.              
FT   REGION        8     16       H-REGION.
FT   REGION       17     21       C-REGION.
FT   TOPO_DOM     22    304       NON CYTOPLASMIC.      |  23-306
FT   TRANSMEM    305    329                             |  307-330
FT   TOPO_DOM    330    348       CYTOPLASMIC.          |  331-348

PolyPhobius

Figure 12: predicted regions by PolyPhobius
Source: http://phobius.sbc.su.se/

PolyPhobius also predicts very accurate but in our case not as accurate as Phobius. But this is just a small difference. The probability distribution is shown in Figure 12.

PREDICTED                                                     ANNOTATION
ID   sp|Q30201|HFE_HUMAN
FT   SIGNAL        1     23                             |  1-20
FT   REGION        1      5       N-REGION.              
FT   REGION        6     19       H-REGION.
FT   REGION       20     23       C-REGION.
FT   TOPO_DOM     24    304       NON CYTOPLASMIC.      |  23-306
FT   TRANSMEM    305    329                             |  307-330
FT   TOPO_DOM    330    348       CYTOPLASMIC.          |  331-348

Additional proteins

BACR_HALSA: Phobius/Polyphobius are almost the same. There is only a slight change in the domain length, furthermore both methods predicted the membrane topologies right.
RET4_HUMAN: Phobius/Polyphobius predicted the position of the signaling peptide correct, the overall prediction is correct, too.
INSL5_HUMAN: Phobius/Polyphobius predict again very accurate and correct. The signaling peptide and extracellular region are at the correct positions.
LAMP1_HUMAN: Phobius/Polyphobius predicted the membrane topology correctly.
A4_HUMAN: Phobius/Polyphobius predicted signaling peptide position is correct.

Summing up, Phobius/Polyphobius are very accurate at their predictions about the membrane topology. Polyphobius takes much more time to produce an result, but provide no really better result - sometimes even a bit worser. Because of that, we decided that Phobious is the more comfortable one.

OCTOPUS and SPOCTOPUS

OCTOPUS is a combined method of HMM's and artificial neural networks. OCTOPUS first create a sequence profile by homology search using BLAST. The profile is used as the input to a set of neural networks which predict the preference of the location for each residue. Each residue is predicted to be either inside or outside the cell and located in a transmembrane (M), interface (I), close loop (L) or globular loop (G) environment.
SPOCTOPUS is an extended version of OCTOPUS that can also predict signal peptides. It use a neural network to predict a signal peptide if the score for each of the 70 N-Terminal residues is high enough.

Both, OCTOPUS and SPOCTOPUS predict the signal peptide and the transmembrane region correctly as you can see in the images below. Also both methods predict a signal peptide at the N-terminus which has the correct length. Figure 13 and Figure 14 show the prediction for the HFE protein.

Figure 13: predicted regions by OCTOPUS
Source: http://octopus.cbr.su.se/

Figure 14: predicted regions by SPOCTOPUS
Source: http://octopus.cbr.su.se/

Additional proteins The probability distribution for the additional proteins is shown in Figures 15 to 19 below.

Figure 15: BACR_HALSA by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php

Figure 16: RET4_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php

Figure 17: INSL5_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php

Figure 18: LAMP1_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php

Figure 19: A4_HUMAN by left: OCTO- right: SPOCTOPUS, source: http://OCTOPUS.cbr.su.se/index.php

BACR_HALSA:

The prediction result of OCTOPUS and SPOCTOPUS is the same. They are both correct, because BACR_HALSA does not contain any signal peptides.

RET4_HUMAN:

OCTOPUS predicts an TM-helix instead of the signal peptide, SPOCTOPUS corrects this error.

INSL5_HUMAN:

Same error for OCTOPUS as seen by RET4_HUMAN, SPOCTOPUS corrects this error again.

LAMP1_HUMAN:

OCTOPUS and SPOCTOPUS predict both the TM-helix correctly. But same error for OCTOPUS as seen with RET4_HUMAN, INSL5_HUMAN: instead of signaling peptide an TM-helix and SPOCTOPUS corrects that.

A4_HUMAN:

OCTOPUS predicts an Reentrant/Dip region instead of the signal peptide region, SPOCTOPUS corrects that.

SignalP

For using it locally at our linux box, we had to correct again some path issues.

The command we used was:

signalp -t y x.fasta > x.signalp

where 'x' stands again for the UniProt entry names of the proteins. 'y' was chosen accordingly to the organism of the protein, for all human proteins 'y' was set to eukaryotes 'euk' and for the bacterial protein P02945 to gram- 'gram-'. This switch specifies the neural network and hidden Markov models, that are separately trained for different organisms.

For the graphical output of HFE_HUMAN we used the SignalP server from: http://www.cbs.dtu.dk/services/SignalP

There are three scorings for the SignalP-prediction NN shown in Figure 20:

C-score: 'cleavage site': raw cleavage site prediction
S-mean-score: 'average of the S-score': discrimination of secretory and non-secretory proteins
Y-max-score: 'combination of C-score with s-core': better cleavage site prediction

Figure 20: SignalP-NN prediction (source: signalp)

SignalP-NN result:
sp|Q30201|HFE_HUMAN   length = 348
Measure   Position Value   Cutoff signal peptide?
max. C    23       0.534   0.32   YES
max. Y    23       0.599   0.33   YES
max. S    16       0.995   0.87   YES
mean S    1-22     0.935   0.48   YES
     D    1-22     0.767   0.43   YES
Most likely cleavage site between pos. 22 and 23: LQG-RL

Figure 21: SignalP-HMM prediction (source: signalp)

SignalP-HMM result:
>sp|Q30201|HFE_HUMAN
Prediction: Signal peptide
Signal peptide probability: 0.998
Signal anchor probability: 0.000
Max cleavage site probability: 0.297 between pos. 22 and 23

SignalP predicts an signal peptide probability with almost 1.0 and thus an signal anchor probability with 0. This leads to the prediction of an cleavage site between pos. 22 and 23 (Figure 21).

According to UniProt is there an signal peptide, it starts at pos. 1 to 22, which means, SignalP has predicted the signal peptide and cleavage site with 100% accuracy.

SignalP-NN result:
>sp_P02945_BACR_HALSA  length = 70
Measure  Position  Value  Cutoff  signal peptide?
max. C    16       0.331   0.52   NO
max. Y    43       0.066   0.33   NO
max. S    32       0.948   0.92   YES
mean S     1-42    0.216   0.49   NO
     D     1-42    0.141   0.44   NO
Most likely cleavage site between pos. 42 and 43: FLV-KG

SignalP-HMM result:
>sp|P02945|BACR_HALSA
Prediction: Non-secretory protein
Signal peptide probability: 0.000
Max cleavage site probability: 0.000 between pos. 15 and 16

For BACR_Halsa is the result of SignalP clearly wrong, because this bacteria does not contain any signal peptide.

SignalP-NN result:
>sp_P02753_RET4_HUMAN  length = 70
Measure  Position  Value  Cutoff  signal peptide?
max. C    19       0.929   0.32   YES
max. Y    19       0.901   0.33   YES
max. S     1       0.994   0.87   YES
mean S     1-18    0.938   0.48   YES
     D     1-18    0.920   0.43   YES
Most likely cleavage site between pos. 18 and 19: GRA-ER

SignalP-HMM result:
>sp_P02753_RET4_HUMAN
Prediction: Signal peptide
Signal peptide probability: 1.000
Signal anchor probability: 0.000
Max cleavage site probability: 0.979 between pos. 18 and 19

SignalP predicted the cleavage very well, according to UniProt the signal peptide is from pos 1 to 18 and afterwards the cleavage.

SignalP-NN result:
>sp_Q9Y5Q6_INSL5_HUMA  length = 70
Measure  Position  Value  Cutoff  signal peptide?
max. C    23       0.855   0.32   YES
max. Y    23       0.778   0.33   YES
max. S    13       0.987   0.87   YES
mean S     1-22    0.852   0.48   YES
     D     1-22    0.815   0.43   YES
Most likely cleavage site between pos. 22 and 23: VRS-KE

SignalP-HMM result:
>sp_Q9Y5Q6_INSL5_HUMAN
Prediction: Signal peptide
Signal peptide probability: 0.999
Signal anchor probability: 0.000
Max cleavage site probability: 0.911 between pos. 22 and 23

This result is also correct predicted (UniProt: signal peptide from pos 1-22).

SignalP-NN result:
>sp_P11279_LAMP1_HUMA  length = 70
Measure  Position  Value  Cutoff  signal peptide?
max. C    29       0.978   0.32   YES
max. Y    29       0.903   0.33   YES
max. S    19       0.999   0.87   YES
mean S     1-28    0.960   0.48   YES
     D     1-28    0.932   0.43   YES
Most likely cleavage site between pos. 28 and 29: ASA-AM

SignalP-HMM result:
>sp_P11279_LAMP1_HUMAN
Prediction: Signal peptide
Signal peptide probability: 1.000
Signal anchor probability: 0.000
Max cleavage site probability: 0.847 between pos. 28 and 29

SignalP predicted again correctly the cleavage site.

SignalP-NN result:
>sp_P05067_A4_HUMAN    length = 70
Measure  Position  Value  Cutoff  signal peptide?
max. C    18       0.891   0.32   YES
max. Y    18       0.850   0.33   YES
max. S     2       0.992   0.87   YES
mean S     1-17    0.967   0.48   YES
     D     1-17    0.909   0.43   YES
Most likely cleavage site between pos. 17 and 18: ARA-LE

SignalP-HMM result:
>sp_P05067_A4_HUMAN
Prediction: Signal peptide
Signal peptide probability: 1.000
Signal anchor probability: 0.000
Max cleavage site probability: 0.993 between pos. 17 and 18

That prediction is accurate, too.

In short, SignalP predicts very accurate the position of the signaling peptide cleavage side, but fails with the non-eukaryont BACR_HALSA.

TargetP

TargetP predict for each of the proteins a signal peptide with high probability. But P02945 which is a bacteria and has no signal peptide, the method seems to be pretty accurate. It scores the prediction at the P02945 with an reliability clause of 4, which is almost neglectable (1 is the highest confidence, 5 is the worst).

### targetp v1.1 prediction results ##################################
Number of query sequences:  6
Cleavage site predictions included.
Using NON-PLANT networks.
Name                  Len            mTP     SP  other  Loc  RC  TPlen
----------------------------------------------------------------------
sp_Q30201_HFE_HUMAN   348          0.433  0.912  0.004   S    3     22
sp_P02945_BACR_HALSA  262          0.019  0.897  0.562   S    4    116
sp_P02753_RET4_HUMAN  201          0.242  0.928  0.020   S    2     18
sp_Q9Y5Q6_INSL5_HUMA  135          0.074  0.899  0.037   S    1     22
sp_P11279_LAMP1_HUMA  417          0.043  0.953  0.017   S    1     28
sp_P05067_A4_HUMAN    770          0.035  0.937  0.084   S    1     17
----------------------------------------------------------------------
cutoff                             0.000  0.000  0.000

Discussion

Prediction of GO terms

General

GO-Terms classify protein functions. Each GO-Term states other protein functions, therefore classifying a protein into GO-Terms means predicting it's functions.

HFE_HUMAN is annotated with 27 different GO Terms which are <ref>http://www.ebi.ac.uk/QuickGO/GProtein?ac=Q30201</ref>:

GOID	GO Term	Aspect
GO:0002474	antigen processing and presentation of peptide antigen via MHC class I	Process
GO:0005515	protein binding	Function
GO:0005737	cytoplasm	Component
GO:0005769	early endosome	Component
GO:0005886	plasma membrane	Component
GO:0005887	integral to plasma membrane	Component
GO:0006461	protein complex assembly	Process
GO:0006810	transport	Process
GO:0006811	ion transport	Process
GO:0006826	iron ion transport	Process
GO:0006879	cellular iron ion homeostasis	Process
GO:0006898	receptor-mediated endocytosis	Process
GO:0006955	immune response	Process
GO:0007565	female pregnancy	Process
GO:0010106	cellular response to iron ion starvation	Process
GO:0016020	membrane	Component
GO:0016021	integral to membrane	Component
GO:0019882	antigen processing and presentation	Process
GO:0031410	cytoplasmic vesicle	Component
GO:0042446	hormone biosynthetic process	Process
GO:0042612	MHC class I protein complex	Component
GO:0045177	apical part of cell	Component
GO:0045178	basal part of cell	Component
GO:0048471	perinuclear region of cytoplasm	Component
GO:0055037	recycling endosome	Component
GO:0055072	iron ion homeostasis	Process
GO:0060586	multicellular organismal iron ion homeostasis	Process

GOPET

Gopet predicted 2 GO-Terms for the HFE_HUMAN which have no overlap to the annotation.

GOID	Aspect	Confidence	GO Term
GO:0004872	Molecular Function	91%	receptor activity
GO:0030106	Molecular Function	88%	MHC class I receptor activity

Additional proteins:

BACR_HALSA

3 GO terms were predicted but only the one with the highest confidence of 77% is really connected to the protein:

ion channel activity

RET4_HUMAN

There were 8 GO terms predicted by GOPET with a confidence from 90% to 60%, 5 of them are linked to the protein:

binding
retinoid binding
retinol binding
transporter activity
retinal binding

INSL5_HUMAN

Only 1 GO term with a confidence of 80% is predicted by GOPET and it is also linked to the protein:

hormone activity

LAMP1_HUMAN

GOPET has predicted 2 GO terms with 60% confidence each, but none is linked to the protein.

A4_HUMAN

GOPET predicted 13 GO terms in a range of 87% to 67% confidence, but only 7 of them are really connected to the protein:

serine-type endopeptidase inhibitor activity
peptidase inhibitor activity
binding
protein binding
metal ion binding
DNA binding
heparin binding

Pfam

Pfam is a database that contains protein domains and families. For our search we used the webserver at http://pfam.sanger.ac.uk/search with standard values.

Afterwards we used the pfam2go database, to find the GO-entries matching the pfam descriptions.

Pfam classifies the HFE_Human protein into two families (Figure 22):

Figure 22: Pfam classification of protein families (source: pfam)

Family: MHC_I (PF00129)
Family: C1-set (PF07654)

For the PF00129 family are four hits at the pfam2go data:

Pfam:PF00129 MHC_I > GO:immune response ; GO:0006955
Pfam:PF00129 MHC_I > GO:antigen processing and presentation ; GO:0019882
Pfam:PF00129 MHC_I > GO:membrane ; GO:0016020
Pfam:PF00129 MHC_I > GO:MHC class I protein complex ; GO:0042612

Figure 23: Significant Pfam-A matches (source: pfam)

All those GO-Entries are at the UniProt entry about HFE_Human, so this family is correct.

For the PF07654 family are no entries at the pfam2go data and thus no validateable cross links to UniProt, maybe this family is yet not included in the pfam2go data.

For a more detailed picture see Figure 23, you can see the Pfam-A matches with alignment.

Additional proteins:

>BACR_HALSA
Pfam:PF01036 Bac_rhodopsin > GO:ion channel activity ; GO:0005216
Pfam:PF01036 Bac_rhodopsin > GO:ion transport ; GO:0006811
Pfam:PF01036 Bac_rhodopsin > GO:membrane ; GO:0016020

>RET4_HUMAN
Pfam:PF00061 Lipocalin > GO:binding ; GO:0005488

>INSL5_HUMAN
Pfam:PF00049 Insulin > GO:hormone activity ; GO:0005179
Pfam:PF00049 Insulin > GO:extracellular region ; GO:0005576

>LAMP1_HUMAN
Pfam:PF01299 Lamp > GO:membrane ; GO:0016020

>A4_HUMAN
Pfam:PF02177 APP_N > GO:binding ; GO:0005488
Pfam:PF02177 APP_N > GO:integral to membrane ; GO:0016021
Family: APP_Cu_bd (PF12924) --> no match at pfam2go
Pfam:PF00014 Kunitz_BPTI > GO:serine-type endopeptidase inhibitor activity ; GO:0004867
Family: APP_E2 (PF12925) --> no match at pfam2go
Pfam:PF03494 Beta-APP > GO:binding ; GO:0005488
Pfam:PF03494 Beta-APP > GO:integral to membrane ; GO:0016021
Family: APP_amyloid (PF10515) --> no match at pfam2go

At a summary, all predicted GO-terms are correct and are cross linked to the corresponding UniProt entries. But also all predicted GO-term are not exhaustive and at UniProt there are many left.

ProtFun 2.2

ProtFun is an ab initio prediction server.

Results
ProtFun assigned immune response(GO:0006955;Process) to HFE what is correct. But ProtFun predicts just one correct GO-number for the HFE-Gen.

 Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.011    0.484
 Biosynthesis_of_cofactors            0.105    1.452
 Cell_envelope                     => 0.633   10.377
 Cellular_processes                   0.095    1.297
 Central_intermediary_metabolism      0.231    3.663
 Energy_metabolism                    0.059    0.659
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.583    2.400
 Regulatory_functions                 0.013    0.079
 Replication_and_transcription        0.019    0.073
 Translation                          0.079    1.801
 Transport_and_binding                0.732    1.785

 Enzyme/nonenzyme                     Prob     Odds
 Enzyme                               0.208    0.727
 Nonenzyme                         => 0.792    1.110

 Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.084    0.404
 Transferase    (EC 2.-.-.-)          0.062    0.179
 Hydrolase      (EC 3.-.-.-)          0.135    0.425
 Lyase          (EC 4.-.-.-)          0.049    1.054
 Isomerase      (EC 5.-.-.-)          0.010    0.321
 Ligase         (EC 6.-.-.-)          0.042    0.827

 Gene Ontology category               Prob     Odds
 Signal_transducer                    0.201    0.939
 Receptor                             0.353    2.076
 Hormone                              0.002    0.365
 Structural_protein                   0.005    0.190
 Transporter                          0.024    0.219
 Ion_channel                          0.008    0.147
 Voltage-gated_ion_channel            0.002    0.085
 Cation_channel                       0.010    0.221
 Transcription                        0.036    0.283
 Transcription_regulation             0.018    0.147
 Stress_response                      0.274    3.108
 Immune_response                   => 0.381    4.486
 Growth_factor                        0.013    0.943
 Metal_ion_transport                  0.009    0.02

Additional proteins:

>sp_P02945_BACR_HALSA

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.033    1.495
 Biosynthesis_of_cofactors            0.186    2.589
 Cell_envelope                        0.029    0.483
 Cellular_processes                   0.051    0.694
 Central_intermediary_metabolism      0.045    0.711
 Energy_metabolism                    0.138    1.537
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.302    1.244
 Regulatory_functions                 0.013    0.080
 Replication_and_transcription        0.019    0.073
 Translation                          0.059    1.339
 Transport_and_binding             => 0.791    1.929

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                               0.199    0.696
 Nonenzyme                         => 0.801    1.122

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.114    0.549
 Transferase    (EC 2.-.-.-)          0.031    0.091
 Hydrolase      (EC 3.-.-.-)          0.057    0.180
 Lyase          (EC 4.-.-.-)          0.020    0.430
 Isomerase      (EC 5.-.-.-)          0.010    0.321
 Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.258    1.205
 Receptor                             0.355    2.087
 Hormone                              0.001    0.206
 Structural_protein                   0.006    0.200
 Transporter                       => 0.440    4.036
 Ion_channel                          0.010    0.169
 Voltage-gated_ion_channel            0.004    0.172
 Cation_channel                       0.078    1.689
 Transcription                        0.026    0.205
 Transcription_regulation             0.028    0.226
 Stress_response                      0.012    0.139
 Immune_response                      0.011    0.128
 Growth_factor                        0.010    0.727
 Metal_ion_transport                  0.049    0.106

ProtFun predicted correctly the functional category, enzyme/no enzyme classification and the gene ontology category. All three predictions are correct.

>sp_P02753_RET4_HUMAN

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.017    0.751
 Biosynthesis_of_cofactors            0.044    0.610
 Cell_envelope                     => 0.804   13.186
 Cellular_processes                   0.075    1.021
 Central_intermediary_metabolism      0.197    3.128
 Energy_metabolism                    0.043    0.475
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.275    1.131
 Regulatory_functions                 0.013    0.080
 Replication_and_transcription        0.022    0.084
 Translation                          0.032    0.721
 Transport_and_binding                0.800    1.951

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                            => 0.544    1.900
 Nonenzyme                            0.456    0.639

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.095    0.458
 Transferase    (EC 2.-.-.-)          0.038    0.109
 Hydrolase      (EC 3.-.-.-)          0.235    0.742
 Lyase          (EC 4.-.-.-)       => 0.059    1.264
 Isomerase      (EC 5.-.-.-)          0.010    0.321
 Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.202    0.942
 Receptor                             0.147    0.862
 Hormone                              0.004    0.667
 Structural_protein                   0.002    0.058
 Transporter                          0.025    0.232
 Ion_channel                          0.016    0.288
 Voltage-gated_ion_channel            0.003    0.148
 Cation_channel                       0.010    0.215
 Transcription                        0.027    0.207
 Transcription_regulation             0.025    0.196
 Stress_response                      0.161    1.829
 Immune_response                   => 0.239    2.813
 Growth_factor                        0.023    1.617
 Metal_ion_transport                  0.009    0.020

ProtFun predicted not correct, the functional category and the enzyme/no enzyme classification and enzyme class is according to UniProt wrong. Only the gene ontology category is correct.

>sp_Q9Y5Q6_INSL5_HUMAN

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.011    0.484
 Biosynthesis_of_cofactors            0.040    0.558
 Cell_envelope                     => 0.756   12.393
 Cellular_processes                   0.033    0.448
 Central_intermediary_metabolism      0.048    0.755
 Energy_metabolism                    0.036    0.397
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.144    0.592
 Regulatory_functions                 0.014    0.087
 Replication_and_transcription        0.020    0.075
 Translation                          0.032    0.735
 Transport_and_binding                0.834    2.033

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                               0.209    0.729
 Nonenzyme                         => 0.791    1.109

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.056    0.268
 Transferase    (EC 2.-.-.-)          0.031    0.091
 Hydrolase      (EC 3.-.-.-)          0.062    0.195
 Lyase          (EC 4.-.-.-)          0.020    0.430
 Isomerase      (EC 5.-.-.-)          0.010    0.321
 Ligase         (EC 6.-.-.-)          0.017    0.327

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.374    1.746
 Receptor                             0.128    0.750
 Hormone                           => 0.247   37.936
 Structural_protein                   0.001    0.041
 Transporter                          0.025    0.228
 Ion_channel                          0.010    0.168
 Voltage-gated_ion_channel            0.003    0.131
 Cation_channel                       0.010    0.215
 Transcription                        0.054    0.425
 Transcription_regulation             0.091    0.724
 Stress_response                      0.099    1.128
 Immune_response                      0.178    2.090
 Growth_factor                        0.061    4.379
 Metal_ion_transport                  0.009    0.020

FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again.

>sp_P11279_LAMP1_HUMAN

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.011    0.484
 Biosynthesis_of_cofactors            0.053    0.735
 Cell_envelope                     => 0.804   13.186
 Cellular_processes                   0.027    0.373
 Central_intermediary_metabolism      0.138    2.188
 Energy_metabolism                    0.037    0.411
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.533    2.195
 Regulatory_functions                 0.015    0.090
 Replication_and_transcription        0.019    0.073
 Translation                          0.027    0.613
 Transport_and_binding                0.834    2.033

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                               0.276    0.965
 Nonenzyme                         => 0.724    1.014

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.039    0.187
 Transferase    (EC 2.-.-.-)          0.046    0.134
 Hydrolase      (EC 3.-.-.-)          0.058    0.184
 Lyase          (EC 4.-.-.-)          0.020    0.430
 Isomerase      (EC 5.-.-.-)          0.010    0.321
 Ligase         (EC 6.-.-.-)          0.017    0.326

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.396    1.849
 Receptor                             0.282    1.659
 Hormone                              0.001    0.206
 Structural_protein                   0.011    0.408
 Transporter                          0.024    0.222
 Ion_channel                          0.008    0.147
 Voltage-gated_ion_channel            0.002    0.111
 Cation_channel                       0.010    0.215
 Transcription                        0.032    0.247
 Transcription_regulation             0.018    0.142
 Stress_response                      0.246    2.795
 Immune_response                   => 0.371    4.368
 Growth_factor                        0.013    0.956
 Metal_ion_transport                  0.009    0.020

FunProt does not predict everything correctly. The functional category is incorrect, but the enzyme/no enzyme and gene ontology category prediction is correct again.

>sp_P05067_A4_HUMAN

# Functional category                  Prob     Odds
 Amino_acid_biosynthesis              0.020    0.921
 Biosynthesis_of_cofactors            0.261    3.623
 Cell_envelope                     => 0.804   13.186
 Cellular_processes                   0.053    0.730
 Central_intermediary_metabolism      0.184    2.920
 Energy_metabolism                    0.023    0.259
 Fatty_acid_metabolism                0.016    1.265
 Purines_and_pyrimidines              0.417    1.716
 Regulatory_functions                 0.013    0.084
 Replication_and_transcription        0.029    0.109
 Translation                          0.027    0.613
 Transport_and_binding                0.827    2.016

# Enzyme/nonenzyme                     Prob     Odds
 Enzyme                            => 0.392    1.368
 Nonenzyme                            0.608    0.852

# Enzyme class                         Prob     Odds
 Oxidoreductase (EC 1.-.-.-)          0.024    0.114
 Transferase    (EC 2.-.-.-)          0.208    0.603
 Hydrolase      (EC 3.-.-.-)          0.190    0.600
 Lyase          (EC 4.-.-.-)          0.020    0.430
 Isomerase      (EC 5.-.-.-)          0.010    0.324
 Ligase         (EC 6.-.-.-)          0.048    0.946

# Gene Ontology category               Prob     Odds
 Signal_transducer                    0.126    0.586
 Receptor                             0.036    0.211
 Hormone                              0.001    0.206
 Structural_protein                => 0.034    1.205
 Transporter                          0.024    0.222
 Ion_channel                          0.009    0.162
 Voltage-gated_ion_channel            0.002    0.108
 Cation_channel                       0.010    0.215
 Transcription                        0.043    0.335
 Transcription_regulation             0.018    0.143
 Stress_response                      0.076    0.862
 Immune_response                      0.016    0.183
 Growth_factor                        0.005    0.372
 Metal_ion_transport                  0.009    0.020

ProtFun failed completely, because all predictions of the functional category and enzyme/no enzyme and the gene ontology category are wrong.

References

@@ Line 339: / Line 339: @@
 ==Discussion==
+Most tools predict a disordered region at the C-terminus of the protein.
 =Prediction of transmembrane alpha-helices and signal peptides=

Difference between revisions of "Sequence-based predictions"

Revision as of 17:57, 30 August 2011

Contents

TODO

Secondary structure prediction

PSIPRED

Jpred3

Comparison with DSSP

Prediction of disordered regions

DISOPRED

POODLE

POODLE-I

POODLE-S

POODLE-L

IUPRED

META-Disorder

Discussion

Prediction of transmembrane alpha-helices and signal peptides

General

TMHMM

Phobius and PolyPhobius

Phobius

PolyPhobius

OCTOPUS and SPOCTOPUS

SignalP

TargetP

Discussion

Prediction of GO terms

General

GOPET

Pfam

ProtFun 2.2

References

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools