Difference between revisions of "Task 3: Sequence-based predictions"

Revision as of 08:43, 30 August 2011

Task description

The full description of this task can be found here.

Task 3.1: Secondary structure prediction

PSIPRED

More information on PSIPRED can be found here: PSIPRED

Run with: sudo ./runpsipred reference.fasta

JPred3

More information on JPred can be found here: Jpred3

Used server: http://www.compbio.dundee.ac.uk/www-jpred/index.html

DSSP

More information on DSSP can be found here: DSSP

Run with: dssp 2PAH.pdb 2PAH.dssp

Result

It was necessary to align the sequence of the pdb-file and the fasta sequence of PAH.



Reference

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAK


1PHZ

------------------GQETSYIEDNSNQNGAISLIFSLKEEVGALAK


DSSP

--------------------------------------------------


PSIPRED

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEECCCCCHHHHH


JPred

--HHHH--HHHHHHHHHH---------------EEEEEEEE----HHHHH


Reference

VLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRH


1PHZ

VLRLFEENDINLTHIESRPSRLNKDEYEFFTYLDKRTKPVLGSIIKSLRN


DSSP

--------------------------------------------------


PSIPRED

HHHHHHHCCCCEEEEECCCCCCCCCCEEEEEECCCCCCHHHHHHHHHHCC


JPred

HHHHHHH---EEEEEE----------EEEEEEEE---HHHHHHHHHHHHH


Reference

DIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFK


1PHZ

DIGATVHELSRDKEKNTVPWFPRTIQELDRFANQI------LDADHPGFK


DSSP

-----------------.....SBGGGGGGTT.S.------..TTSTTTT


PSIPRED

CCEEEECCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHCCCCCCCCCCCCC


JPred

H-----EEE----------------HHHHHH---EEE-------------


Reference

DPVYRARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKT


1PHZ

DPVYRARRKQFADIAYNYRHGQPIPRVEYTEEEKQTWGTVFRTLKALYKT


DSSP

.HHHHHHHHHHHHHHHH..TTS........HHHHHHHHHHHHHHHHHHHH


PSIPRED

CHHHHHHHHHHHHHHHCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCC


JPred

-HHHHHHHHHHHH-----------------HHHHHHHHHHHHHHHHH---


Reference

HACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLS


1PHZ

HACYEHNHIFPLLEKYCGFREDNIPQLEDVSQFLQTCTGFRLRPVAGLLS


DSSP

HB.HHHHHHHHHHHHHS..BTTB...HHHHHHHHHHHT..EEEE.SS...


PSIPRED

CHHHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHHCCEEEECCCCCC


JPred

--HHHHHHHHHHHHHH----------HHHHHHHHHHH---EEEE------


Reference

SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA


1PHZ

SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA


DSSP

HHHHHHHHTTTEEEE......TT.TT..SS..HHHHHTTTTTTTTSHHHH


PSIPRED

HHHHHHHCCCCEECCCEEEECCCCCCCCCCCCHHHHHHCCCCCCCCCHHH


JPred

HHHHHHHH----EEEEEEE-----------HHHHHHHH--------HHHH


Reference

QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS


1PHZ

QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKEGDSIKAYGAGLLSS


DSSP

HHHHHHHHHHTT..HHHHHHHHHHHHTTTTT.EEEETTEEEE..HHHHT.


PSIPRED

HHHHHHHHHCCCCCHHHHHHHHHHEEEEEEEEEECCCCCEEEECCCCCCC


JPred

HHHHHHHHHHH---HHHHHHHHH-HHHEEEEEEEEE---EEEEE------


Reference

FGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVR


1PHZ

FGELQYCLSDKPKLLPLELEKTACQEYSVTEFQPLYYVAESFSDAKEKVR


DSSP

HHHHHHTTSSSS..EE..HHHHTT....SSS..S..EEES.HHHHHHHHH


PSIPRED

HHHHHHHHCCCCCCCCCCHHHHHCCCCCCCCCCEEEEEECCHHHHHHHHH


JPred

HHHHHHHH-----EE---HHHHH-----------EEEE---HHHHHHHHH


Reference

NFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK


1PHZ

TFAATIPRPFSVRYDPYTQRVEVLDNT-----------------------


DSSP

HHHHTS..SSEEEEETTTTEEEEE.HHHHHHHHHHHHHHHHHHHHHHHHH


PSIPRED

HHHHHCCCCCEEEECCCCCEEEECCCHHHHHHHHHHHHHHHHHHHHHHHH


JPred

HHHHHH------------EEEEE---HHHHHHHHHHHHHHHHHHHHHHHH


Reference

IK


1PHZ

--


DSSP

T.


PSIPRED

HC


JPred

--

Discussion

DSSP can be regarded as a kind of reference for the secondary structure. It uses the coordinates of a resolved structure and is therefore much more reliable than PSIPRED or JPRED. There are more tools, which uses the resolved structure, but their predictions don't have to be the same. The results of these kind of methods are good hints for the secondary structure, but depend strongly on the used definitions of the different secondary structure elements. PSIPRED and JPRED predict the secondary structure only with the amino acid sequence of the protein. The advantage of these methods is that the resolved structure of the protein is not needed. To compare PSIPRED and JPRED with the DSSP result it is necessary to translate the 8-state prediction of DSST to a three state prediction. There are several possibilities to do this (see ref).

H (alpha-Helix), G (3_10 helix), I (pi-helix) -> H (alpha-helix)
E (extended strand) -> E (beta-strand)
B (residue in isolated beta-bridge), T (turn), S (bend), . (rest, coil) -> C (loop, coil)

There are two measures to evaluate secondary structure predictions: Q3 (true positives) and Segments OVerlapping (see ref). In this case the proteinmodel server was used to calculate the scores.

method - score	ALL	HELIX	STRAND	COIL
JPred - Q3	83.9	90.5	67.7	80.0
PSIPRED - Q3	83.9	89.9	71.0	80.0
JPred - SOV	84.9	91.8	66.6	81.0
PSIPRED - SOV	87.3	96.1	78.5	80.2

The difference between Jpred and PSIPRED is marginal. Both performed well on our sequence. Probably there was enough knowledge about our sequence in the training sets of these methods.

One of the major problems in secondary structure prediction are the misclassification of an observed helix as sheet and vise versa. The following tables show the frequency of the different classification and misclassification for the two prediction methods.

JPRED

observed/predicted	frequency
HC	15
HH	143
CC	112
EC	10
EE	21
CE	16
CH	12

PSIPRED

observed/predicted	frequency
HC	14
HH	142
CC	112
HE	2
EC	9
CE	14
EE	22
CH	14

JPred does not misclassify H to E or E to H. PSIPred misclassifies H to E only two times. Therefore the two prediction methods seem to be quite similar in quality.

Task 3.2: Prediction of disordered regions

DISOPRED

Run with: ./rundisopred reference.fasta

Result

Position	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
Sequence	M	S	T	A	V	L	E	N	P	G	L	G	R	K	L	S	D	F	G	Q	E	T
IUPRED long	0.3840	0.4051	0.4220	0.3356	0.3599	0.3872	0.2918	0.3149	0.3494	0.3807	0.4017	0.4256	0.3529	0.2715	0.3087	0.3740	0.4652	0.3910	0.3910	0.3704	0.3840	0.4652
IUPRED short	0.9447	0.8823	0.8457	0.8074	0.7540	0.6442	0.6035	0.5711	0.4458	0.4037	0.3668	0.4149	0.4116	0.4116	0.3578	0.3225	0.3939	0.3885	0.3885	0.2865	0.2913	0.3630
MD	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D
Poodle	D	D	D	-	-	-	-	D	D	D	D	D	D	D	D	D	D	D	D	D	D	D
Position	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44
Sequence	S	Y	I	E	D	N	C	N	Q	N	G	A	I	S	L	I	F	S	L	K	E	E
IUPRED long	0.4441	0.3704	0.3704	0.3704	0.2988	0.1969	0.2884	0.2129	0.1969	0.2002	0.2064	0.1554	0.2258	0.2783	0.1969	0.1731	0.1554	0.1554	0.0967	0.0888	0.0518	0.0269
IUPRED short	0.3630	0.3668	0.3456	0.3359	0.3359	0.2385	0.2385	0.1602	0.1532	0.0832	0.0771	0.0858	0.1416	0.1532	0.1416	0.1117	0.0643	0.0813	0.0789	0.0502	0.0308	0.0297
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	D	D	D	D	D	D	D	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	D	D	D	D	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66
Sequence	V	G	A	L	A	K	V	L	R	L	F	E	E	N	D	V	N	L	T	H	I	E
IUPRED long	0.0334	0.0592	0.0662	0.1070	0.1184	0.2094	0.1476	0.2193	0.2258	0.1476	0.1449	0.1942	0.1914	0.2436	0.3117	0.3321	0.2645	0.3182	0.3910	0.3983	0.4864	0.5139
IUPRED short	0.0218	0.0218	0.0231	0.0414	0.0723	0.1266	0.0789	0.1380	0.0935	0.0832	0.0771	0.0677	0.0701	0.0858	0.1322	0.1844	0.1878	0.2385	0.2255	0.2292	0.3053	0.3939
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	86	87	88
Sequence	S	R	P	S	R	L	K	K	D	E	Y	E	F	F	T	H	L	D	K	R	S	L
IUPRED long	0.5098	0.3948	0.2817	0.2817	0.3460	0.2575	0.3426	0.3321	0.3249	0.4017	0.3182	0.3494	0.3286	0.2258	0.2328	0.2503	0.2503	0.1823	0.1823	0.1184	0.0719	0.1137
IUPRED short	0.3885	0.3053	0.3096	0.2209	0.2167	0.2122	0.3005	0.2167	0.2122	0.3005	0.2865	0.2913	0.2167	0.2209	0.1998	0.1495	0.2209	0.2292	0.1635	0.1088	0.1088	0.1088
MD	-	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	D	D	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110
Sequence	P	A	L	T	N	I	I	K	I	L	R	H	D	I	G	A	T	V	H	E	L	S
IUPRED long	0.1070	0.1852	0.2064	0.2034	0.1759	0.2503	0.1881	0.1942	0.2034	0.1424	0.2034	0.1759	0.2002	0.2680	0.2575	0.2364	0.3149	0.3948	0.3392	0.4409	0.4051	0.3019
IUPRED short	0.0643	0.1060	0.1602	0.1566	0.0991	0.1667	0.1495	0.1456	0.1060	0.1041	0.1635	0.0832	0.0965	0.1416	0.1456	0.1456	0.2041	0.2786	0.2820	0.3535	0.3535	0.4037
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	D
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	D	D	D	D	D	D
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	D
Position	111	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128	129	130	131	132
Sequence	R	D	K	K	K	D	T	V	P	W	F	P	R	T	I	Q	E	L	D	R	F	A
IUPRED long	0.3215	0.3149	0.3948	0.3149	0.3494	0.3566	0.3392	0.3426	0.3286	0.2988	0.2752	0.2918	0.2988	0.2224	0.1583	0.1611	0.1007	0.1007	0.1229	0.1137	0.1349	0.2292
IUPRED short	0.3146	0.2385	0.3578	0.3399	0.3578	0.2913	0.3762	0.3885	0.3184	0.4116	0.4037	0.3263	0.3053	0.3096	0.3359	0.2748	0.2122	0.2167	0.2167	0.1844	0.2483	0.3399
MD	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	D	D	D	D	D	D	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	D	D	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	133	134	135	136	137	138	139	140	141	142	143	144	145	146	147	148	149	150	151	152	153	154
Sequence	N	Q	I	L	S	Y	G	A	E	L	D	A	D	H	P	G	F	K	D	P	V	Y
IUPRED long	0.1823	0.1881	0.1852	0.2849	0.2752	0.1643	0.2292	0.2292	0.2575	0.3053	0.2503	0.2364	0.2002	0.2752	0.3494	0.3494	0.4409	0.3321	0.3286	0.3215	0.3182	0.2918
IUPRED short	0.2167	0.2333	0.2041	0.2820	0.2657	0.2963	0.3630	0.2700	0.2786	0.3668	0.4245	0.3578	0.2602	0.3184	0.3491	0.3456	0.4116	0.3992	0.4458	0.3668	0.4333	0.4333
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	155	156	157	158	159	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175	176
Sequence	R	A	R	R	K	Q	F	A	D	I	A	Y	N	Y	R	H	G	Q	P	I	P	R
IUPRED long	0.2328	0.2436	0.1671	0.1449	0.1399	0.2328	0.2470	0.2680	0.1702	0.2470	0.3249	0.2645	0.3019	0.2193	0.1852	0.2002	0.1969	0.3149	0.3356	0.3286	0.4119	0.3286
IUPRED short	0.4078	0.3847	0.3096	0.2748	0.1998	0.2786	0.2786	0.2657	0.2432	0.3311	0.3263	0.3535	0.3885	0.3263	0.3263	0.2558	0.1958	0.2602	0.2820	0.2748	0.3311	0.3399
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191	192	193	194	195	196	197	198
Sequence	V	E	Y	M	E	E	E	K	K	T	W	G	T	V	F	K	T	L	K	S	L	Y
IUPRED long	0.4119	0.4017	0.4256	0.3215	0.3215	0.3249	0.2364	0.2094	0.2849	0.1852	0.1206	0.1643	0.1583	0.2292	0.2399	0.1476	0.0851	0.0851	0.0506	0.0568	0.0531	0.0662
IUPRED short	0.4037	0.2786	0.3491	0.3491	0.2865	0.1998	0.1878	0.1635	0.1635	0.1349	0.1322	0.1240	0.0771	0.1150	0.1349	0.1240	0.1060	0.0542	0.0316	0.0363	0.0212	0.0464
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	199	200	201	202	203	204	205	206	207	208	209	210	211	212	213	214	215	216	217	218	219	220
Sequence	K	T	H	A	C	Y	E	Y	N	H	I	F	P	L	L	E	K	Y	C	G	F	H
IUPRED long	0.0334	0.0380	0.0341	0.0398	0.0443	0.0424	0.0405	0.0235	0.0244	0.0198	0.0313	0.0356	0.0364	0.0405	0.0287	0.0605	0.1092	0.0618	0.1137	0.1115	0.0765	0.1184
IUPRED short	0.0425	0.0268	0.0182	0.0414	0.0245	0.0128	0.0259	0.0259	0.0137	0.0087	0.0160	0.0084	0.0081	0.0094	0.0128	0.0279	0.0279	0.0308	0.0554	0.0308	0.0376	0.0660
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	221	222	223	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239	240	241	242
Sequence	E	D	N	I	P	Q	L	E	D	V	S	Q	F	L	Q	T	C	T	G	F	R	L
IUPRED long	0.2129	0.1048	0.1028	0.1671	0.1583	0.0929	0.1554	0.2399	0.1373	0.2328	0.1643	0.1528	0.1643	0.1070	0.1424	0.1229	0.0704	0.0704	0.0618	0.0568	0.0851	0.0929
IUPRED short	0.0771	0.0621	0.1150	0.1088	0.0554	0.0643	0.1060	0.1266	0.1292	0.2080	0.1380	0.1240	0.0660	0.0643	0.1349	0.0701	0.0526	0.0991	0.0490	0.0268	0.0441	0.0441
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	243	244	245	246	247	248	249	250	251	252	253	254	255	256	257	258	259	260	261	262	263	264
Sequence	R	P	V	A	G	L	L	S	S	R	D	F	L	G	G	L	A	F	R	V	F	H
IUPRED long	0.0424	0.0518	0.0851	0.0799	0.0454	0.0690	0.0320	0.0320	0.0424	0.0218	0.0334	0.0184	0.0165	0.0244	0.0178	0.0117	0.0174	0.0258	0.0269	0.0263	0.0253	0.0275
IUPRED short	0.0414	0.0789	0.0701	0.0395	0.0395	0.0744	0.0464	0.0376	0.0405	0.0425	0.0514	0.0297	0.0200	0.0173	0.0252	0.0274	0.0286	0.0304	0.0304	0.0268	0.0259	0.0259
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	265	266	267	268	269	270	271	272	273	274	275	276	277	278	279	280	281	282	283	284	285	286
Sequence	C	T	Q	Y	I	R	H	G	S	K	P	M	Y	T	P	E	P	D	I	C	H	E
IUPRED long	0.0443	0.0424	0.0433	0.0506	0.0948	0.1323	0.2399	0.1611	0.1323	0.2193	0.2436	0.2645	0.1791	0.1702	0.2470	0.2609	0.2849	0.2094	0.1184	0.1184	0.1323	0.1137
IUPRED short	0.0502	0.1041	0.0813	0.0464	0.0771	0.1117	0.1958	0.2209	0.2963	0.3184	0.2255	0.3184	0.3311	0.2333	0.2483	0.3311	0.3263	0.2786	0.3096	0.2385	0.1667	0.1416
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	287	288	289	290	291	292	293	294	295	296	297	298	299	300	301	302	303	304	305	306	307	308
Sequence	L	L	G	H	V	P	L	F	S	D	R	S	F	A	Q	F	S	Q	E	I	G	L
IUPRED long	0.1349	0.1184	0.0948	0.0851	0.0364	0.0300	0.0300	0.0592	0.0662	0.0690	0.0356	0.0494	0.0817	0.0483	0.0518	0.0646	0.0662	0.1184	0.2292	0.1554	0.0909	0.1028
IUPRED short	0.1958	0.2748	0.2786	0.1698	0.1322	0.1240	0.0660	0.1041	0.1878	0.2167	0.1240	0.1844	0.1635	0.1349	0.1349	0.1088	0.0909	0.1088	0.1958	0.2041	0.2041	0.1322
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	309	310	311	312	313	314	315	316	317	318	319	320	321	322	323	324	325	326	327	328	329	330
Sequence	A	S	L	G	A	P	D	E	Y	I	E	K	L	A	T	I	Y	W	F	T	V	E
IUPRED long	0.1007	0.1229	0.1229	0.1229	0.1449	0.0929	0.0454	0.0205	0.0349	0.0235	0.0433	0.0263	0.0275	0.0275	0.0165	0.0191	0.0174	0.0165	0.0160	0.0248	0.0244	0.0214
IUPRED short	0.0789	0.1456	0.1602	0.0884	0.1178	0.1117	0.0567	0.0286	0.0327	0.0160	0.0327	0.0252	0.0297	0.0304	0.0286	0.0194	0.0078	0.0066	0.0055	0.0090	0.0194	0.0157
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	331	332	333	334	335	336	337	338	339	340	341	342	343	344	345	346	347	348	349	350	351	352
Sequence	F	G	L	C	K	Q	G	D	S	I	K	A	Y	G	A	G	L	L	S	S	F	G
IUPRED long	0.0188	0.0184	0.0218	0.0178	0.0281	0.0287	0.0327	0.0646	0.0605	0.0463	0.0405	0.0929	0.0555	0.0967	0.1007	0.0592	0.0327	0.0320	0.0327	0.0327	0.0662	0.0662
IUPRED short	0.0086	0.0157	0.0157	0.0083	0.0157	0.0226	0.0350	0.0376	0.0200	0.0350	0.0304	0.0308	0.0316	0.0621	0.0677	0.0771	0.0455	0.0268	0.0128	0.0131	0.0308	0.0316
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	353	354	355	356	357	358	359	360	361	362	363	364	365	366	367	368	369	370	371	372	373	374
Sequence	E	L	Q	Y	C	L	S	E	K	P	K	L	L	P	L	E	L	E	K	T	A	I
IUPRED long	0.0483	0.0494	0.0581	0.0424	0.0473	0.0483	0.0929	0.0909	0.0967	0.1501	0.0929	0.0870	0.1501	0.0888	0.1501	0.1671	0.2470	0.2715	0.1501	0.1611	0.1611	0.1048
IUPRED short	0.0526	0.0909	0.0514	0.0395	0.0490	0.0490	0.0991	0.0554	0.0643	0.1088	0.1088	0.0813	0.0813	0.0813	0.1349	0.1495	0.2167	0.1532	0.1322	0.1495	0.0701	0.0789
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	375	376	377	378	379	380	381	382	383	384	385	386	387	388	389	390	391	392	393	394	395	396
Sequence	Q	N	Y	T	V	T	E	F	Q	P	L	Y	Y	V	A	E	S	F	N	D	A	K
IUPRED long	0.1048	0.1028	0.0568	0.0765	0.0765	0.1229	0.0543	0.0618	0.0662	0.0646	0.1048	0.1115	0.0985	0.1070	0.1028	0.1611	0.0870	0.0631	0.1007	0.1007	0.0483	0.0985
IUPRED short	0.1566	0.1698	0.0965	0.1060	0.0567	0.0858	0.0789	0.0832	0.0502	0.0643	0.1205	0.0965	0.0909	0.1456	0.1416	0.1456	0.1456	0.1416	0.1495	0.1205	0.0858	0.1292
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	397	398	399	400	401	402	403	404	405	406	407	408	409	410	411	412	413	414	415	416	417	418
Sequence	E	K	V	R	N	F	A	A	T	I	P	R	P	F	S	V	R	Y	D	P	Y	T
IUPRED long	0.1554	0.2575	0.2164	0.2470	0.1671	0.1702	0.1969	0.1969	0.2064	0.1643	0.1759	0.1611	0.1671	0.1528	0.1671	0.1092	0.1251	0.1583	0.2002	0.2002	0.2951	0.2715
IUPRED short	0.1205	0.2041	0.2786	0.3456	0.2432	0.2483	0.2602	0.2558	0.1766	0.2122	0.2558	0.1805	0.1844	0.2657	0.2657	0.1732	0.2657	0.2209	0.1667	0.1805	0.2657	0.2292
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	419	420	421	422	423	424	425	426	427	428	429	430	431	432	433	434	435	436	437	438	439	440
Sequence	Q	R	I	E	V	L	D	N	T	Q	Q	L	K	I	L	A	D	S	I	N	S	E
IUPRED long	0.1942	0.1643	0.1852	0.1275	0.1554	0.1702	0.2399	0.1501	0.1424	0.2094	0.2193	0.1399	0.1399	0.1424	0.0835	0.0799	0.1275	0.0985	0.0555	0.0568	0.0518	0.0294
IUPRED short	0.2385	0.2122	0.2080	0.2041	0.1878	0.1240	0.1416	0.1456	0.1240	0.1041	0.1041	0.1018	0.1088	0.0991	0.0965	0.0832	0.0813	0.0514	0.0526	0.0425	0.0200	0.0182
MD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Position	441	442	443	444	445	446	447	448	449	450	451	452
Sequence	I	G	I	L	C	S	A	L	Q	K	I	K
IUPRED long	0.0463	0.0473	0.0662	0.0870	0.0888	0.0618	0.0483	0.0765	0.0494	0.0364	0.0214	0.0356
IUPRED short	0.0363	0.0173	0.0316	0.0744	0.1240	0.1456	0.1766	0.3146	0.3578	0.3992	0.4333	0.5802
MD	-	-	-	-	-	-	-	-	D	D	D	D
DisoPred	-	-	-	-	-	-	-	-	-	-	-	-
Poodle	-	-	-	-	-	-	-	-	-	-	D	D

Discussion

It's difficult to compare the different methods, because there is no reference. In the database Disprot (Disprot) you can find the disordered regions of different proteins. In a database search with our reference sequence we found one reliable match with DP00094. If this alignment is right, there should be a disordered region in our protein from residue 30 to 107. IUPRED measures the disorder with a kind of score between 0 and 1. It's hard to define a threshold. Assuming that IUPRED predicts a disorder in this regions the threshold is about 0.25. Of course this can only be an approximation.

method	beginning region	central region	end region
MD	1-3	110-113	449-452
DisoPred	1-32	105-118	none
Poodle	1-29	110-112	451-552
IUPRED long	1-29	136-182	none
IUPRED short	1-27	136-181	448-452

The methods seem to hit something in the neighborhood of the the region, which is disordered according to the Disprot. The data on disorder and therefore the methods' prediction trained on this data is not very accurate. Perhaps with more accurate experiments to identify disorder, prediction methods get a better base.

According to the hints in the oral discussion of this task, we compare the predictions to the secondary structure and the b-factors of the CA. Most experimental structures known for PAH are limited to the residues 110 to 424. For this comparison we used the experimental structure 1J8U for PAH.

Figure 1: This figure shows the residues of the structure 1j8u of PAH colored by the b-factor of their C-alpha atom.

Figure 1 shows the residues of PAH colored by the b-factor of their C-alpha atom. The b-factor is a measure of the flexibility of an atom in an experimental structure. There are three regions, which seem to be of higher flexibility.

region	secondary structure	remark
118-120	coil	begin of the experimental structure
130-135	alpha-helix
422-424	beta-sheet	end of the experimental structure

We were surprised by this visualization. In fact we did not expect to see the greatest flexibility in structured regions like an alpha-helix or a beta-sheet. The flexibility of the peripheral and solvent exposed parts of the structures can be somehow explained. But the flexibility of the alpha-helix residues 130 to 135 are interesting. In nature PAH is a tetra-homo-mere. It consists of a regulatory N-terminal domain (residues 1-117), the catalytic domain (residues 118-427), and a C-terminal domain (residues 428-453) responsible for oligomerization of the identical monomers. The flexible region 422 to 424 is probably more defined in the tetra-homo-mere form. PAH seems to increase in volume during its catalytic activity. PAH is usually a homo-tetra-mere connected by the C-terminal domain. Therefore it seems reasonable, that the linker residues 422 to 424 are more flexible. The linker is probably not disordered, but the connecting domain can be.

There were four other regions with high flexibility flanking the binding pocket within the catalytic domain.

region	secondary structure
247-248	coil
378-381	coil
145	coil
135	sheet

These regions are probably flexible due to the catalytic function of the domain. They are probably more defined during certain steps of the catalytic activity of the protein. Their flexibility is probably necessary to keep the solvent away from the binding pocket. At least no method predicted these sites to be disordered.

Figure 2: This figure shows the residues of the structure 1phz of PAH colored by the b-factor of their C-alpha atom.

A more complete experimental structure, 1PHZ, of the rat PAH is shown in figure 2. The structure shows the N-terminal regulatory domain. This domain seems to be very unstructured and flexible. The first 20 amino acids could not be located in the experiment, which indicates a very high flexibility. This domain is usually excluded in experiments, because it is proposed to be too flexible and increases the susceptibility to proteases. Therefore the prediction of disorder in this region is probably right.

In the following we pay respect to the secondary structure of the regions predicted to be disordered.

method	region	secondary structure	secondary structure origin
MD	1-3	CC	JPred
MD	110-113	CCCC	JPred
MD	449-452	HHCC	DSSP
DisoPred	1-32	CCHHHHCCHHHHHHHHHHCCCCCCCCCCCCCC	JPred
DisoPred	105-118	CCEEECCCCCCCCC	JPred
Poodle	1-29	CCHHHHCCHHHHHHHHHHCCCCCCCCCCC	JPred
Poodle	110-112	CCC	JPred
Poodle	451-552	CC	DSSP
IUPRED long	1-29	CCHHHHCCHHHHHHHHHHCCCCCCCCCCC	JPred
IUPRED long	136-182	------CCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCCCCCCCCCH	DSSP
IUPRED short	1-27	CCHHHHCCHHHHHHHHHHCCCCCCCCC	JPred
IUPRED short	136-181	------CCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCCCCCCCCCH	DSSP
IUPRED short	448-452	HHHCC	DSSP

Every method hit parts of the protein with defined structure (helix, sheet). The secondary structure was evaluated by DSSP, where it was possible. DSSP uses an experimental structure. In our case it was 1PHZ. 1PHZ contains a Fe-atom. Perhaps this is enough to introduce order in disordered regions of the protein. As seen in the experimental structure 1PHZ, the N-terminal domain is unstructured. Therefore the secondary structure prediction of JPred in this region is probably wrong.

In order to summarize the results:

There is probably disorder in the N-terminal regulatory domain. The first 20 amino acids of 1PHZ were not able to be detected in the experiment and this domain is usually excluded in experimental structures, because it is too flexible and attracts proteases. This indicates disorder in the N-terminal region. This was predicted by the methods. The experimental structure of 1PHZ shows that this region is mostly unstructured and of high flexibility.
There is probably no disorder in the catalytic domain. Some of the surface loops are flexible, but this flexibility is probably part of the catalytic process.
The C-terminal domain, which is necessary to build the homo-tetra-mere, may contain disorder, but it is unlikely. The domain seems to be structurally defined. The flexible linker residues 422 to 424 are flexible due to the expansion of the catalytic domain during the procession of phenylalanine.

The prediction of disorder seems to be in the fledgling stages.

Task 3.3: Prediction of transmembrane alpha-helices and signal peptides

Annotated sequence features

PAH

The phenylalanine-4-hydroxylase has no annotated signal peptide or transmembrane helices.

BACR_HALSA

The bacteriorhodopsin has the following annotated signal peptide and transmembrane helices:

Position	Feature Name	Description
1 - 13	Propeptide
14 – 23	Topological domain	Extracellular
24 - 42	Transmembrane	Helical; Name=Helix A
43 – 56	Topological domain	Cytoplasmic
57 - 75	Transmembrane	Helical; Name=Helix B
76 – 91	Topological domain	Extracellular
92 - 109	Transmembrane	Helical; Name=Helix C
110 – 120	Topological domain	Cytoplasmic
121 - 140	Transmembrane	Helical; Name=Helix D
141 – 147	Topological domain	Extracellular
148 - 167	Transmembrane	Helical; Name=Helix E
168 – 185	Topological domain	Cytoplasmic
186 - 204	Transmembrane	Helical; Name=Helix F
205 – 216	Topological domain	Extracellular
217 - 236	Transmembrane	Helical; Name=Helix G
237 – 262	Topological domain	Cytoplasmic

RET4_HUMAN

The retinol-binding protein 4 has the following annotated signal peptide (no transmembrane helices are annotated):

Position	Feature Name	Description
1 - 18	Signal peptide

INSL5_HUMAN

The Insulin-like peptide INSL5 has the following annotated signal peptide (no transmembrane helices are annotated):

Position	Feature Name	Description
1 - 22	Signal peptide

LAMP1_HUMAN

The lysosome-associated membrane glycoprotein 1 has the following annotated signal peptide and transmembrane helices:

Position	Feature Name	Description
1 - 28	Signal peptide
29 – 382	Topological domain	Lumenal
383 - 405	Transmembrane	Helical;
406 – 417	Topological domain	Cytoplasmic

A4_HUMAN

The Amyloid beta A4 protein has the following annotated signal peptide and transmembrane helices:

Position	Feature Name	Description
1 - 17	Signal peptide
18 – 699	Topological domain	Extracellular
700 - 723	Transmembrane	Helical;
724 – 770	Topological domain	Cytoplasmic

General Questions to prediction of transmembrane alpha-helices and signal peptides

Why is the prediction of transmembrane helices and signal peptides grouped together here?

Methods which only predict transmembrane helices often predict signal peptides as transmembrane helices as well. The reason for this is that both, transmembrane helices and signal peptides consist mainly of hydrophobic residues. These false predictions lead to inaccurate topological features and thus to wrongly annotated function of a protein. To avoid these cases most recent methods couple their transmembrane prediction together with a signal peptide prediction.

Description of different signal peptides

Signalpeptides for the import to the endoplasmic reticulum (ER)

The import to the ER is usually required for the secretory pathway (to export proteins out of a cell). The import process can occur either co-translational (the nascent protein chain is translocated together with the ribosome) or post-translational (only the fully synthesized protein is transported to the ER). However, for both cases the SEC-pathway is mostly used.

The co-translational transport to the ER is done by the signal recognition particle (SRP). This particle recognizes the N-terminal signal-sequence of the nascent polypeptide chain and then transports it to the ER membrane where the complex, consisting of SRP, polypeptide chain and ribosome, is recognized by the ER membrane bound signal recognition particle receptor (SR). After this recognition the polypeptide chain is imported into the ER lumen via the SEC channel in an ATP dependent process.

The post-translational import to the ER lumen is done by chaperons which guide the polypeptide chain to the SEC channel which is then imported in an ATP dependent process.

However, not only the import to the ER lumen is possible, an import to the ER membrane is possible as well. So far, 5 different types of import to the ER membrane are known.

Type 1 requires an N-terminal signal sequence and an intrinsic stop transfer anchor sequence which will be the part which is inserted in the membrane.

Type 2 and 3 do not require a N-terminal signal sequence only a intrinsic signal anchor sequence is required. The difference between type 2 and 3 is that type 2 has positively charged residues before the signal anchor sequence (on the N-Terminal side) and type 3 has positively charged residues after the signal anchor sequence (on C-Terminal side). These charged residues of trans-membrane protein are always in the cytosol. Thus, type 2 inserted proteins have their N-terminal end residing in the cytosol whereas type 3 inserted proteins have a C-terminal end in the cytosol.

Type 4-A and 4-B insertion is also known as multipass membrane insertion. These proteins have not one trans-membrane helix like the proteins imported via Type 1,2 and 3, instead they have several trans-membrane helices. Hence, they consist of multiple internal stop-transfer anchor sequences and internal signal-anchor sequences. The difference between type 4-A and 4-B is that in type 4-A the N and C terminal ends are located in the cytosol whereas type 4-B import results in a N-terminal end residing in the ER lumen and a C-terminal end residing in the cytosol.

In addition to the N-terminal import of trans-membrane proteins there is also the possiblity for a C-terminal import. Obviously, these proteins are imported post-translation.

Signalpeptides for the import to the mitochondrion

There are several targets for import to the mitochondrion, proteins can be translocated to the matrix, the outer membrane, the inner membrane and the inter membrane space.

Proteins who are designated to be imported to the matrix of a mitochondrion have a N-terminal matrix-targeting sequence. This mitochondrial import to the matrix is assisted by chaperons (Hsc70) which guide the protein to the import pore complex of the mitochondrion. The import through the outer membrane is conducted by the TOM complex and the following import through the inner membrane is conducted by the TOM complex. After successful import to the matrix the signal sequence is cleaved off by proteolytic active enzymes.

Import to the inner membrane can occur in three ways. The first way is the TIM22 pathway, proteins using this pathway need internal targeting sequences. The next way is the stop transfer import, for this proteins need a stop transfer sequence and a N-terminal matrix targeting sequence. The third way is called conservative sorting proteins using this pathway have a N-terminal targeting sequence as well and in addition intrinsic Oxa1-targeting sequences which are recognized by Ox1-proteins which execute the import to the membrane.

Proteins imported to the outer membrane of a mitochondrion usually have PORTA domains which are recognized by the TOB/SAM complex.

Signalpeptides for the import to the chloroplast

Proteins heading to chloroplasts can target different parts of it. For example the stroma, inner and outer membrane, the thylakoids membrane or the thylakoids lumen.

Usually these protein have a N-terminal targeting sequence.

Signalpeptides for the import to the peroxisome

Peroxisomal proteins can be imported to the lumen or to the membrane. Proteins imported to the lumen have either a peroxisomal targeting signal at the C-termins (also known as PTS1) or a targeting sequence close to the N-terminus (also known as PTS2). Proteins imported to the membrane can have an intrinsic membrane peroxisomal targeting signal (mPTS). However, not all proteins have this mPTS. These proteins are imported to the ER and from there they bud off together with the mature peroxisome.

Signalpeptides for the import to the nucleus and the export form the nucleus

Proteins which are imported to the nucleus require a nuclear localisation signal (NLS) which is recognized by importin. The NLS containing protein is then imported via the nuclear pore complex (NPC) to the nucleoplasm.

Proteins which are exported from the nucleus require a nuclear export signal which is recognized by exportin, a protein which binds to the NES of the cargo protein. In addition to exportin a second component, known as Ran*GTP, is required to mediate the export through the NPC.

TMHMM

Details of the method

Author: Sonnhammer, Heijne & Krogh

Year: 1998

Reference: PubMed

Description

Figure 3: HMM architecture of TMHMM Disclaimer: This file is redistributed from [Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.] . All rights belong to the creator.

This method is based on a hidden markov model (HMM). The authors of this method tried to model the 'grammar' of transmembrane proteins in order to predict the protein topology of transmembrane more accurate than methods who only e.g. rely on propensity values and do not consider the topological constraints of these class of proteins.

TMHMM defined for their HMM for each feature one or more states which present this feature. For example the transmembrane helix is modeled by three sub models. A model for the helix core, the cap of the helix which lies partly in the cytoplasm and the membrane and the cap which is partly in the membrane and cytoplasm. In addition to this helix model they also created sub models for the cytoplasmic loop and the non-cytoplasmic loop as well as a sub model for the globular region. Each sub model can reflect one or more states in the HMM model. For example the globular sub model only consists of one HMM state whereas the helix-core and caps are modeled by multiple HMM states.

The 'grammar' is incorporated to this HMM model by defining the possible transitions from one sub model to another one. For example it is only possible to change from a cytoplasmic loop region to a cytoplasmic cap region and then to the helix core and after that either to non-cytoplasmic short loop or long non-cytoplasmic loop and so on.

Predicted features

This methods predicts the transmembrane helix and whether this part is in the cytoplasm (in) or outside of it (out).

Required information for the prediction

User who want to use it just need their amino acid sequence of their query sequence. The transmission and emission probabilities are derived from 160 transmembrane protein sequences.

Execution

Before we could execute TMHMM we had to change all occurrences of "/usr/local/bin/" to "/usr/bin" in these files: tmhmm, tmhmm.ORIG and tmhmmformat.pl

Then we executed the following command to retrieve the results for all sequences:

tmhmm all.fa > task_33/tmhmm_out.txt

Results and discussion

PAH

Position	Feature Name
1 - 452	outside

TMHMM predicted no transmembrane helix as expected. However, what TMHMM predicted is that this protein is outside (outside of cytosol). Which is wrong if we look at the annotations from UniProt which says that it appears in the cytosol.

BACR_HALSA

Position	Feature Name
1 - 22	outside
23 - 42	TMhelix
43 - 54	inside
55 - 77	TMhelix
78 - 91	outside
92 - 114	TMhelix
115 - 120	inside
121 - 143	TMhelix
144 - 147	outside
148 - 170	TMhelix
171 - 189	inside
190 - 212	TMhelix
213 - 262	outside

The transmembrane helices of BACR_HALSA were almost correctly predicted. The predicted positions of the transmembrane helices differ only by few positions. However, TMHMM failed to predict the last helix (Helix G, from 217 - 236) and hence also the C-terminal end was falsly predicted to be outside which is actually inside the cytoplasm.

RET4_HUMAN

Position	Feature Name
1 - 201	outside

TMHMM predicted no transmembrane helix which is according to UniProt correct. Also the predicted cellular location (outside) is correctly predicted.

INSL5_HUMAN

Position	Feature Name
1 - 135	outside

TMHMM predicted no transmembrane helix which is according to UniProt correct. Also the predicted cellular location (outside) is correctly predicted.

LAMP1_HUMAN

Position	Feature Name
1 - 10	inside
11 - 33	TMhelix
34 - 383	outside
384 - 406	TMhelix
407 - 417	inside

TMHMM predicted the first transmembrane helix wrongly (position 11 - 33) which is not existent with reference to Uniport. However, the second predicted transmembrane helix is correct. As an effect of the wrongly predicted first transmembrane helix the N-terminal and was predicted to be inside (cytosol) which is actually outside (lumenal).

A4_HUMAN

Position	Feature Name
1 - 700	outside
701 - 723	TMhelix
724 - 770	inside

TMHMM predicted all transmembrane helices and the topology of this protein correctly.

Phobius

Details of the method

Author: Käll, Krogh, Sonnhammer

Year: 2004

Reference: PubMed

Description

Figure 4: HMM architecture of Phobius Disclaimer: This file is redistributed from [KKäll L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004 May 14;338(5):1027-36.] . All rights belong to the creator.

Phobius is an HMM based prediction method to predict transmembrane helices as well as N-terminal signal peptides. More precisely, it is a combination of the two HMM models of TMHMM and SignalP which is merged into one HMM. This was done in order to overcome problems associated with transmembrane helix prediction: signale peptides are often wrongly predicted as transmembrane helices. The complete architecture can be seen in the figure.

Predicted features

Phobius predicts transmembrane helices, signal peptides and the topology of the loops (whether they are inside the cytoplasm or not).

Required information for the prediction

Users only has to enter the amino acid sequence of their query protein in FASTA format.

Execution

We used the standard parameter of Phobius and submitted the sequences of all requested proteins as one fasta file. As seen in figure 4.

Figure 5: A screenshot of the input form of Phobius for PAH.

Results and discussion

Figure 6: A screenshot of the output of Phobious for PAH.

PAH

Phobius predicted the protein to be non cytoplasmic which is wrong. The location of PAH is in the cytoplasm as stated by UniProt.

BACR_HALSA

Phobius predicted all transmembrane helices as well as the topology of BACR_HALSA correctly. Only the boundaries differed slightly to the annotated ones in UniProt.

RET4_HUMAN

The signal peptide as well as the topology was predicted correctly by Phobius.

INSL5_HUMAN

The signal peptide as well as the topology was predicted correctly by Phobius.

LAMP1_HUMAN

The signal peptide, transmembrane helices and the topology was predicted correctly by Phobius.

A4_HUMAN

The signal peptide, transmembrane helices and the topology was predicted correctly by Phobius.

PolyPhobius

Details of the method

Author: Käll L, Krogh A, Sonnhammer EL

Year: 2005

Reference: PubMed

Description

PolyPhobius is also based on a HMM which constraints the possible transitions from one state to another in order to reflect the 'grammar' of transmembrane proteins. However, the difference to the ordinary Phobius is that it uses knowledge homologous sequences of the query sequences as well to make the prediction more accurate.

In order to do so it calculates for each sequence position for each label (e.g. transmembrane helix, in, out, etc...) for each homologous sequence the posterior label probability (PLP). The PLP is defined as "the probability of a label at a certain position in the sequence, given the sequence and the model" (quoted from "Käll L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence feature prediction that includes homology information Bioinformatics. 2005 Jun;21 Suppl 1:i251-7."). Then a multiple sequence alignment (MSA) of all homologous sequences is build, for each position in the MSA a average PLP is calculated. This average PLP will be then be used by the optimal accuracy algorithm to predict the most likely sequences of states for a given query sequence and thus the topology of the transmembrane helices.

Predicted features

This method predicts the same features as the ordinary Phobius, which means transmembrane helices, the signal peptide and whether the connecting loops of transmembrane helices are inside or outside.

Required information for the prediction

User need the amino acid sequence of their protein in FASTA format. An additional option is to specify the homologous sequences manually. If that is not done PolyPhobius will search for homologous sequences by itself by using BLAST.

Execution

We used the standard parameter of PolyPhobius and submitted the sequences of all requested proteins as one fasta file. As seen in figure 7.

Figure 7: A screenshot of the input form of PolyPhobius for PAH.

Results and discussion

Figure 8: A screenshot of the output of PolyPhobius for PAH.

PAH

PolyPhobius predicted the protein to be non cytoplasmic which is wrong. The location of PAH is in the cytoplasm as stated by UniProt.

BACR_HALSA

PolyPhobius predicted all transmembrane helices as well as the topology of BACR_HALSA correctly. Only the boundaries differed slightly to the annotated ones in UniProt.

RET4_HUMAN

The signal peptide as well as the topology was predicted correctly by PolyPhobius.

INSL5_HUMAN

The signal peptide as well as the topology was predicted correctly by PolyPhobius.

LAMP1_HUMAN

The signal peptide, transmembrane helices and the topology was predicted correctly by PolyPhobius.

A4_HUMAN

The signal peptide, transmembrane helices and the topology was predicted correctly by PolyPhobius.

OCTOPUS

Details of the method

Author: Viklund H, Elofsson A.

Year: 2008

Reference: Bioinformatics

Description

Figure 9: Flowchart of OCTOPUS Disclaimer: This file is redistributed from [Viklund H, Elofsson A. OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics. 2008 Aug 1;24(15):1662-8. Epub 2008 May 12.] . All rights belong to the creator.

OCTOPUS basically uses two methods to predict the topology of transmembrane proteins: artificial neural networks (ANN) and hidden markov models (HMM). In a first step BLAST searches for homologous sequences of a input FASTA sequence. From the found homologous sequences a multiple sequence alignment is build from which a raw sequence profile and a sequence profile based on PSSM are extracted. These profiles are used for two sets of ANNs.

The first set of ANNs contains four separate ANNs which predict the residue preference for M (Membrane), I (Interface), L (Loop), G (Globular). In order to make the predictions for G and M more smooth the output of the first row of ANNs output is used for a second ANN as input.The second set of ANNs is taken to predict the residue preferences for the inside/outside residues.

Finally. the output of these two sets of ANNs are used to parameterize the OCTOPUS-HMM for the actual topological feature prediction. This HMM is needed to model the 'grammar' of trans membrane proteins, which simply means that only certain state transitions are allowed. For example, if we assume we are currently in the transmembrane state then it is only allowed to go into the loop state and so on and so forth.

The state sequences which fits best the input sequence is then calculated by the Viterbi algorithm.

Predicted features

Predicted features are inside/outside (i/o), transmembrane (M), TM hairpin (H), reentrant (R) or membrane dip (D)

Required information for the prediction

Only the amino acid sequence of the users protein is required.

Execution

We entered for each protein the amino acid sequence in fasta format. The corresponding input form can be regarded in figure 10.

Figure 10: A screenshot of the input form of OCTOPUS for PAH.

Results and discussion

PAH

Figure 11: A screenshot of the output of OCTOPUS for PAH.

No transmembrane helix was predicted. This is correct accordingly to the UniProt annotations.

BACR_HALSA

Figure 12: A screenshot of the output of OCTOPUS for BACR_HALSA.

Octopus predicted all transmembrane helices as well as the topology of BACR_HALSA correctly.

RET4_HUMAN

Figure 14: A screenshot of the output of OCTOPUS for RET4_HUMAN.

Octopus predicted the signal peptide to be a transmembrane helix which is incorrect. However, the topology (outside) was predicted correctly.

INSL5_HUMAN

Figure 15: A screenshot of the output of OCTOPUS for INSL5_HUMAN.

Octopus predicted the signal peptide to be a transmembrane helix which is incorrect. However, the topology (outside) was predicted correctly.

LAMP1_HUMAN

Figure 16: A screenshot of the output of OCTOPUS for LAMP1_HUMAN.

Octopus predicted an transmembrane helix close to the N-terminal end which is not correct. Only the second predicted transmembrane helic is correct as well as the topology.

A4_HUMAN

Figure 17: A screenshot of the output of OCTOPUS for A4_HUMAN.

Octopus predicted a reentran/dip region close to the N-terminal end which is not annotated by UniProt. However, the rest is correct (second predicted transmemrane helix and overall topology).

SPOCTOPUS

Details of the method

Author: Viklund H, Bernsel A, Skwark M, Elofsson A.

Year: 2008

Reference: Bioinformatics

Description

SPOCTOPUS works the same way as OCTOPUS does. The only difference is that it includes a signal peptide prediction.

Predicted features

Predicted features are signal peptide, inside/outside (i/o), transmembrane (M), TM hairpin (H), reentrant (R) or membrane dip (D)

Required information for the prediction

Only the amino acid sequence of the query protein is required as input.

Execution

We entered for each protein the amino acid sequence in fasta format. Figure 18 shows the procedure for PAH.

Figure 18: A screenshot of the input form of SPOCTOPUS for PAH.

Results and discussion

PAH

Figure 19: A screenshot of the output of SPOCTOPUS for PAH.

No transmembrane helix was predicted. This is correct accordingly to the UniProt annotations.

BACR_HALSA

Figure 20: A screenshot of the output of SPOCTOPUS for BACR_HALSA.

Spoctopus predicted all transmembrane helices as well as the topology of BACR_HALSA correctly.

RET4_HUMAN

Figure 21: A screenshot of the output of SPOCTOPUS for RET4_HUMAN.

Spoctopus predicted the signal peptide as well as the topology correctly.

INSL5_HUMAN

Figure 22: A screenshot of the output of SPOCTOPUS for INSL5_HUMAN.

Spoctopus predicted the signal peptide as well as the topology correctly.

LAMP1_HUMAN

Figure 23: A screenshot of the output of SPOCTOPUS for LAMP1_HUMAN.

Spoctopus predicted the signal peptide, the transmembrane helix as well as the topology correctly.

A4_HUMAN

Figure 24: A screenshot of the output of SPOCTOPUS for A4_HUMAN.

Spoctopus predicted the signal peptide, the transmembrane helix as well as the topology correctly.

SignalP

Details of the method

Author: Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.

Year: 1997

Reference: PubMed

Description

This predictor takes two methods into account the first method used is a neural network the second is a hidden markov model.

There are two neural networks one which is predicting whether the first n amino acids belong to a signal peptide and the second network predicts the exact cleavage side positon.

In a later version of SignalP a hidden markov model (HMM) has been also build to predict signal peptides. However, this prediction is completely independent from the neural network prediction. This HMM models the N-terminal region of a signal peptide as well as the surrounding cleavage site.

Predicted features

Predicts the presence of signal peptidase I cleavage sites and whether the first n residues belong to a signal peptide.

Required information for the prediction

The amino acid sequence of the protein and whether this protein is from a eukaryote, gram-negative bacteria or gram-positive bacteria.

Execution

Before we could execute SignalP on our virtual machine we had to change the path of the signalp file to /apps/signalp-3.0

Then we executed for each protein the following commands:

signalp -format short -t euk PAH.fa > task_33/signalp_pah_out
signalp -format short -t euk A4_HUMAN.fa > task_33/signalp_a4_human_out
signalp -format short -t gram- BACR_HALSA.fa > task_33/signalp_bacr_halsa_out
signalp -format short -t euk LAMP1_HUMAN.fa > task_33/signalp_lamp1_human_out
signalp -format short -t euk RET4_HUMAN.fa > task_33/signalp_ret4_human_out
signalp -format short -t euk INSL5_HUMAN.fa > task_33/signalp_insl5_human_out

Results and discussion

PAH

SignalP predicted in both methods (HMM and NN) that there is no cleavage site for an signal peptide. Which is correct with respect to the annotations in UniProt.

BACR_HALSA

SignalP predicted in both methods (HMM and NN) that there is no cleavage site for an signal peptide. Which is correct with respect to the annotations in UniProt.

RET4_HUMAN

SignalP predicted in both methods (HMM and NN) the cleavage site at positions 19. Which is correct with respect to the annotations in UniProt.

INSL5_HUMAN

SignalP predicted in both methods (HMM and NN) the cleavage site at positions 23. Which is correct with respect to the annotations in UniProt.

LAMP1_HUMAN

SignalP predicted in both methods (HMM and NN) the cleavage site at positions 29. Which is correct with respect to the annotations in UniProt.

A4_HUMAN

SignalP predicted in both methods (HMM and NN) the cleavage site at positions 18. Which is correct with respect to the annotations in UniProt.

TargetP

Details of the method

Author: Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.

Year: 1997

Reference: PubMed

Description

Figure 25: Architecture of TargetP Disclaimer: This file is redistributed from Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000 Jul 21;300(4):1005-16.] . All rights belong to the creator.

TargetP's prediction are based on trained neural networks. These neural networks are build up in a two layer setup. The first layer consists of three neural networks which are used to predict whether it is a chloroplast targeting sequence, a mitochondrial targeting sequence or a signal peptide. The output of this first layer is then used in the second layer neural network as input to make the final prediction. Then the decision unit decides whether the cutoffs are obeyed. The output is then one of three classes cTP/mTP/SP/other and a reliability class value (RC) which is an indicator for the predictions certainty.

However, if a non-plant protein is entered the prediction for cTP is not applied for obvious reasons.

Predicted features

Predicts the localization to the following targets: chloroplast, mitochondrion, ER/golgi/secreted, and "other".

Required information for the prediction

The amino acid sequence of the protein and whether this protein is from a plant or non-plant organism.

Execution

We used the standard parameter of TargetP and submitted all sequences via one fasta file to the prediction server.

Figure 26: A screenshot of the input form of TargetP for the submission of a collection of sequences (PAH, BACR_HALSA, RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN, A4_HUMAN).

Results and discussion

Figure 27: A screenshot of the output of TargetP for a collection of sequences (PAH, BACR_HALSA, RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN, A4_HUMAN).

PAH

TargetP predicted no target protein. This is correct with respect to the annotations from UniProt.

BACR_HALSA

TargetP predicted a signal peptide of length 116 and that the protein will be secreted. This is incorrect with respect to the annotations from Uniprot. These say that there is no target protein and that this protein is integrated in the cytoplasmic membrane.

RET4_HUMAN

TargetP predicted a signal peptide of length 18 and that the protein will be secreted. This is correct with respect to the annotations from UniProt.

INSL5_HUMAN

TargetP predicted a signal peptide of length 22 and that the protein will be secreted. This is correct with respect to the annotations from UniProt.

LAMP1_HUMAN

TargetP predicted a signal peptide of length 28 and that the protein will be secreted. This is correct with respect to the annotations from UniProt.

A4_HUMAN

TargetP predicted a signal peptide of length 17 and that the protein will be secreted. This is correct with respect to the annotations from UniProt.

Task 3.4: Prediction of GO terms

Annotated sequence features

PAH

The phenylalanine-4-hydroxylase has the following annotated GO terms:

Class	GO Identifier	GO Name
Function	GO:0003824	catalytic activity
Function	GO:0004497	monooxygenase activity
Function	GO:0004505	phenylalanine 4-monooxygenase activity
Function	GO:0005506	iron ion binding
Component	GO:0005829	cytosol
Process	GO:0006558	L-phenylalanine metabolic process
Process	GO:0006559	L-phenylalanine catabolic process
Process	GO:0006571	tyrosine biosynthetic process
Process	GO:0008152	metabolic process
Process	GO:0008652	cellular amino acid biosynthetic process
Process	GO:0009072	aromatic amino acid family metabolic process
Function	GO:0016491	oxidoreductase activity
Function	GO:0016597	amino acid binding
Function	GO:0016714	oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen
Process	GO:0018126	protein hydroxylation
Process	GO:0034641	cellular nitrogen compound metabolic process
Process	GO:0042136	neurotransmitter biosynthetic process
Process	GO:0042423	catecholamine biosynthetic process
Process	GO:0042558	pteridine-containing compound metabolic process
Function	GO:0042803	protein homodimerization activity
Process	GO:0046146	tetrahydrobiopterin metabolic process
Function	GO:0046872	metal ion binding
Function	GO:0048037	cofactor binding
Process	GO:0055114	oxidation-reduction process

BACR_HALSA

The bacteriorhodopsin has the following annotated GO terms:

Class	GO Identifier	GO Name
Function	GO:0004872	receptor activity
Function	GO:0005216	ion channel activity
Component	GO:0005886	plasma membrane
Process	GO:0006810	transport
Process	GO:0006811	ion transport
Process	GO:0007602	phototransduction
Function	GO:0009881	photoreceptor activity
Process	GO:0015992	proton transport
Component	GO:0016020	membrane
Component	GO:0016021	integral to membrane
Process	GO:0018298	protein-chromophore linkage
Process	GO:0050896	response to stimulus

RET4_HUMAN

The retinol-binding protein 4 has the following annotated GO terms:

Class	GO Identifier	GO Name
Process	GO:0001654	eye development
Function	GO:0005215	transporter activity
Function	GO:0005488	binding
Function	GO:0005501	retinoid binding
Function	GO:0005515	protein binding
Component	GO:0005576	extracellular region
Component	GO:0005615	extracellular space
Process	GO:0006094	gluconeogenesis
Process	GO:0006810	transport
Process	GO:0007283	spermatogenesis
Process	GO:0007507	heart development
Process	GO:0007601	visual perception
Process	GO:0008584	male gonad development
Process	GO:0009790	embryo development
Function	GO:0016918	retinal binding
Function	GO:0019841	retinol binding
Process	GO:0030277	maintenance of gastrointestinal epithelium
Process	GO:0030324	lung development
Process	GO:0032024	positive regulation of insulin secretion
Process	GO:0032526	response to retinoic acid
Process	GO:0032868	response to insulin stimulus
Function	GO:0034632	retinol transporter activity
Process	GO:0034633	retinol transport
Process	GO:0042572	retinol metabolic process
Process	GO:0042574	retinal metabolic process
Process	GO:0042593	glucose homeostasis
Process	GO:0045471	response to ethanol
Process	GO:0048562	embryonic organ morphogenesis
Process	GO:0048706	embryonic skeletal system development
Process	GO:0048738	cardiac muscle tissue development
Process	GO:0048807	female genitalia morphogenesis
Process	GO:0050896	response to stimulus
Process	GO:0050908	detection of light stimulus involved in visual perception
Process	GO:0051024	positive regulation of immunoglobulin secretion
Process	GO:0060041	retina development in camera-type eye
Process	GO:0060044	negative regulation of cardiac muscle cell proliferation
Process	GO:0060059	embryonic retina morphogenesis in camera-type eye
Process	GO:0060065	uterus development
Process	GO:0060068	vagina development
Process	GO:0060157	urinary bladder development
Process	GO:0060347	heart trabecula formation

INSL5_HUMAN

The insulin-like peptide INSL5 has the following annotated GO terms:

Class	GO Identifier	GO Name
Function	GO:0005179	hormone activity
Component	GO:0005575	cellular_component
Component	GO:0005576	extracellular region
Process	GO:0008150	biological_process

LAMP1_HUMAN

The lysosome-associated membrane glycoprotein 1 has the following annotated GO terms:

Class	GO Identifier	GO Name
Component	GO:0005624	membrane fraction
Component	GO:0005764	lysosome
Component	GO:0005765	lysosomal membrane
Component	GO:0005768	endosome
Component	GO:0005770	late endosome
Component	GO:0005771	multivesicular body
Component	GO:0005886	plasma membrane
Component	GO:0005887	integral to plasma membrane
Process	GO:0006914	autophagy
Component	GO:0009897	external side of plasma membrane
Component	GO:0009986	cell surface
Component	GO:0010008	endosome membrane
Component	GO:0016020	membrane
Component	GO:0016021	integral to membrane
Component	GO:0031982	vesicle
Component	GO:0042383	sarcolemma
Component	GO:0042470	melanosome

A4_HUMAN

The amyloid beta A4 protein has the following annotated GO terms:

Class	GO Identifier	GO Name
Process	GO:0000085	G2 phase of mitotic cell cycle
Process	GO:0001967	suckling behavior
Process	GO:0002576	platelet degranulation
Function	GO:0003677	DNA binding
Function	GO:0004867	serine-type endopeptidase inhibitor activity
Function	GO:0005102	receptor binding
Function	GO:0005488	binding
Function	GO:0005515	protein binding
Component	GO:0005576	extracellular region
Component	GO:0005624	membrane fraction
Component	GO:0005737	cytoplasm
Component	GO:0005794	Golgi apparatus
Component	GO:0005886	plasma membrane
Component	GO:0005887	integral to plasma membrane
Component	GO:0005905	coated pit
Process	GO:0006378	mRNA polyadenylation
Process	GO:0006417	regulation of translation
Process	GO:0006468	protein phosphorylation
Process	GO:0006878	cellular copper ion homeostasis
Process	GO:0006897	endocytosis
Process	GO:0006915	apoptosis
Process	GO:0006917	induction of apoptosis
Process	GO:0007155	cell adhesion
Process	GO:0007176	regulation of epidermal growth factor receptor activity
Process	GO:0007219	Notch signaling pathway
Process	GO:0007409	axonogenesis
Process	GO:0007596	blood coagulation
Process	GO:0007617	mating behavior
Process	GO:0007626	locomotory behavior
Process	GO:0008088	axon cargo transport
Function	GO:0008201	heparin binding
Process	GO:0008219	cell death
Process	GO:0008344	adult locomotory behavior
Process	GO:0008542	visual learning
Component	GO:0009986	cell surface
Process	GO:0010466	negative regulation of peptidase activity
Process	GO:0010952	positive regulation of peptidase activity
Component	GO:0016020	membrane
Component	GO:0016021	integral to membrane
Process	GO:0016199	axon midline choice point recognition
Process	GO:0016322	neuron remodeling
Process	GO:0016358	dendrite development
Function	GO:0016504	peptidase activator activity
Component	GO:0019717	synaptosome
Process	GO:0030168	platelet activation
Process	GO:0030198	extracellular matrix organization
Function	GO:0030414	peptidase inhibitor activity
Component	GO:0030424	axon
Process	GO:0030900	forebrain development
Component	GO:0031093	platelet alpha granule lumen
Process	GO:0031175	neuron projection development
Component	GO:0031410	cytoplasmic vesicle
Component	GO:0031594	neuromuscular junction
Function	GO:0033130	acetylcholine receptor binding
Process	GO:0035235	ionotropic glutamate receptor signaling pathway
Component	GO:0035253	ciliary rootlet
Process	GO:0040014	regulation of multicellular organism growth
Function	GO:0042802	identical protein binding
Component	GO:0043005	neuron projection
Component	GO:0043197	dendritic spine
Component	GO:0043198	dendritic shaft
Component	GO:0043231	intracellular membrane-bounded organelle
Process	GO:0045087	innate immune response
Component	GO:0045177	apical part of cell
Component	GO:0045202	synapse
Process	GO:0045665	negative regulation of neuron differentiation
Process	GO:0045931	positive regulation of mitotic cell cycle
Process	GO:0045944	positive regulation of transcription from RNA polymerase II promoter
Function	GO:0046872	metal ion binding
Component	GO:0048471	perinuclear region of cytoplasm
Process	GO:0048669	collateral sprouting in absence of injury
Process	GO:0050803	regulation of synapse structure and activity
Process	GO:0050885	neuromuscular process controlling balance
Process	GO:0051124	synaptic growth at neuromuscular junction
Component	GO:0051233	spindle midzone
Process	GO:0051402	neuron apoptosis
Function	GO:0051425	PTB domain binding
Process	GO:0051563	smooth endoplasmic reticulum calcium ion homeostasis

GOPET

Details of the method

Author: Vinayagam A, König R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S

Year: 2004

Reference: PubMed

Description

Flowchart of GOPET Disclaimer: This file is redistributed from [Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, König R. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006 Mar 20;7:161.] . All rights belong to the creator.

The prediction of GO terms is based on support vector machine (SVM) predictions. The training of this SVM was done with 39,740 selected GO-annotated cDNA sequences. For each of this training sequence they extract all annotated GO terms. In a next step they search for homologous sequences with blast with a e-value < 0.01. Sequences which fulfill this condition are used to extract attributes: including sequence similarity meas- ures, such as e-value, bitscore, identity, coverage score, alignment length, GO-term frequency, GO-term relationships between homologues, the level of annotation within the GO hierarchy and annotation quality of the homologues.

These attributes are then assigned to each GO term found in the training sequence. The training of the SVM is then done by taking the GO term and its associated attributes to train the SVM.

After the training the SVM is capable to predict GO terms from unknown cDNA or protein sequences in the same fashion.

Predicted features

GOPET predicts the GO term together with a confidence value.

Required information for the prediction

The cDNA or amino acid sequence of the protein is required.

Execution

We used the standard parameter of GOPET and submitted all requested protein sequences at once to the server.

Results and discussion

PAH

Gopet predicted falsely the following GO terms: GO:0004510, GO:0004511, GO:0008199, GO:0008198. These GO terms are not annotated by UniProt.

BACR_HALSA

Gopet predicted falsely the following GO terms: GO:0008020, GO:0015078. These GO terms are not annotated by UniProt.

RET4_HUMAN

Gopet predicted falsely the following GO terms: GO:0008289, GO:0005319 and GO:0008035. These GO terms are not annotated by UniProt.

INSL5_HUMAN

Gopet predicted no false GO terms.

LAMP1_HUMAN

Gopet predicted falsely the following GO terms: GO:0004812, GO:0005524. These GO terms are not annotated by UniProt.

A4_HUMAN

Gopet predicted falsely the following GO terms: GO:003568, GO:0030304, GO:0030414, GO:0008270, GO:0005507 and GO:0005506. These GO terms are not annotated by UniProt.

Pfam

Details of the method

Author: Wellcome Trust Sanger Institute and Howard Hughes Janelia Farm Research Campus

Year: latest release in March 2011

Reference: Oxford Journals

Description

Pfam is a protein family sequence database. In order to build families a seed sequence alignment of homologous sequences is build which all belong to the same family. This alignment is then used to build a profile hidden markov model (HMM) which is then represent one family. These profile HMM can then be used to search in your query sequence or in sequence database for significant family matches. The tool used to do all this is HMMER3.

Predicted features

Pfam predicts protein families.

Required information for the prediction

The amino acid sequence of the protein.

Execution

We used the standard parameter for Pfam and submitted all sequences in one fasta file to the prediction server.

Results and discussion

PAH

Pfam could predict the domain ACT from 36 to 88. This domain is also annotated in UniProt. Although the position of this domain is not correct, the correct position is from 35 to 110. Secondly, Pfam predicted another domain wich is called Biopterin_H which is not annotated by UniProt.

BACR_HALSA

Pfam could predict the domain Bacteriorhodopsin from 23 to 253. This domain is also annotated in UniProt. Although the position of this domain is not correct, the correct position is from 14 to 262. Secondly, Pfam predicted another domain wich is called DUF21 which is not annotated by UniProt.

RET4_HUMAN

Pfam predicted the domains DspF and Lipocalin from 12 to 60 and from 39 to 173 respectively. Although these domains are not annotated by UniProt. In addition Pfam was not able to predict the domains Retinol-binding protein 4 and Plasma retinol-binding protein which are annotated by UniProt.

INSL5_HUMAN

Pfam predicted the domain Insulin from 27 to 135. This seems to be a correct prediction since INSL5_HUMAN is a insulin protein. However the correct annotated domains by UniProt are Insulin-like peptide INSL5 B chain and Insulin-like peptide INSL5 A chain.

LAMP1_HUMAN

Pfam predicted the domain Lamp from 29 to 109 and from 111 to 417 as two domains. In reality these domains are annotated as one by UniProt. In addition Pfam predicted the domain DUF1180 which is not annotated by UniProt.

A4_HUMAN

Pfam predicted the domains APP_N, Beta-App and APP_amyloid. For these domains there seems to be a annotation of it in UniProt. Although the predicted position for these features differ highly from the annotated ones. In additon Pfam predicted the following domains which are not annotated in UniProt: APP_Cu_bd, Kunitz_BPTI, APP_E2, Exonuc_VII_L and Activator_TraM.

ProtFun 2.2

Details of the method

Author: L. Juhl Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H. H. Stærfeldt, K. Rapacki, C. Workman, C. A. F. Andersen, S. Knudsen, A. Krogh, A. Valencia and S. Brunak.

Year: 2002

Reference: PubMed

Description

The prediction of GO terms is based on a neural network. The training sequence set was obtained from looking for protein families and their assigned GO terms in the InterPro database and then mapping these InterPro domain matches to SWISS-PROT and TrEMBL to get the actual sequence information. In order to avoid over-fitting a homology reduction was performed afterwards. Then a set of 16 features for each sequence was derived which include features such as propeptide cleavage site predictions and subcellular compartment predictions from TargetP.

Then the training to the neural network was applied to find out the best weight for each feature and GO term. However, after extensive training they figured out that the method gives only reliable predictions to 14 GO categories and thus only these were selected to be predicted by the neural network.

Predicted features

ProtFun predicts the cellular role, whether the protein is a enzyme or not, the enzyme class and the Gene ontology category. The predicted gene ontology categories are :

Signal transducer
Receptor
Hormone
Structural protein
Transporter
Ion channel
Voltage-gated ion channel
Cation channel
Transcription
Transcription regulation
Stress response
Immune response
Growth factor
Metal ion transport

Required information for the prediction

Only the amino acid sequence of the protein is required.

Execution

We used the standard parameter and submitted all requested sequences at once.

Results and discussion

PAH

ProtFun was not able to predict one of the 14 gene ontology categories with significance (no arrow). Although, ProtFun predicted the protein to be involved in amino acid biosynthesis and to be an enzyme which is correct. However, it false predicted PAH to be an isomerase.

BACR_HALSA

Protfun predicted BACR_HALSA to be involved in transport and binding which is correct. The second prediction that it is no enzyme is also correct. Laslty, the third prediction, ion channel is also correct. Thus, all three significant predictions made by ProtFun are correct.

RET4_HUMAN

ProtFun predicted as function to be a cell envelop for which I did not find any evidence in UniProt. Second it said it is a enzyme which is also not correct since it is a transporter protein. Third it predicted this protein belongs to the enzyme class lyase for which I also found no evidence. However the gene ontology category immune response seems to be predicted correctly.

INSL5_HUMAN

ProtFun predicted as function to be a cell envelop for which I did not find any evidence in UniProt. The prediction noenzyme seems to be correct for me. The gene ontology category prediction was hormone which is also correct.

LAMP1_HUMAN

ProtFun predicted as function to be a cell envelop for which I did not find any evidence in UniProt. The prediction noenzyme seems to be correct for me. The gene ontology category prediction was immune response which is also correct.

A4_HUMAN

The functional category prediction was cell envelope for which I did not find any evidence. Secondly it predicted this protein to be a enzyme which is not correct. UniProt declared this protein to be a receptor. Thirdly, it says it is a structural protein which is also not correct.

@@ Line 4,708: / Line 4,708: @@
 We used the standard parameter of TargetP and submitted all sequences via one fasta file to the prediction server.
-[[File:Targetp in.png|thumb|center| 1000 px | '''Figure 26:''' A screenshot of the input form of TargetP for PAH.]]
+[[File:Targetp in.png|thumb|center| 1000 px | '''Figure 26:''' A screenshot of the input form of TargetP for the submission of a collection of sequences (PAH, BACR_HALSA, RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN, A4_HUMAN).]]
 ==== Results and discussion ====
-[[File:Targetp out.png|thumb|center| 1000 px | '''Figure 27:''' A screenshot of the output of TargetP for PAH.]]
+[[File:Targetp out.png|thumb|center| 1000 px | '''Figure 27:''' A screenshot of the output of TargetP for a collection of sequences (PAH, BACR_HALSA, RET4_HUMAN, INSL5_HUMAN, LAMP1_HUMAN, A4_HUMAN).]]
 ===== PAH =====

Reference	MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAK
1PHZ	------------------GQETSYIEDNSNQNGAISLIFSLKEEVGALAK
DSSP	--------------------------------------------------
PSIPRED	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEECCCCCHHHHH
JPred	--HHHH--HHHHHHHHHH---------------EEEEEEEE----HHHHH
Reference	VLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRH
1PHZ	VLRLFEENDINLTHIESRPSRLNKDEYEFFTYLDKRTKPVLGSIIKSLRN
DSSP	--------------------------------------------------
PSIPRED	HHHHHHHCCCCEEEEECCCCCCCCCCEEEEEECCCCCCHHHHHHHHHHCC
JPred	HHHHHHH---EEEEEE----------EEEEEEEE---HHHHHHHHHHHHH
Reference	DIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFK
1PHZ	DIGATVHELSRDKEKNTVPWFPRTIQELDRFANQI------LDADHPGFK
DSSP	-----------------.....SBGGGGGGTT.S.------..TTSTTTT
PSIPRED	CCEEEECCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHCCCCCCCCCCCCC
JPred	H-----EEE----------------HHHHHH---EEE-------------
Reference	DPVYRARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKT
1PHZ	DPVYRARRKQFADIAYNYRHGQPIPRVEYTEEEKQTWGTVFRTLKALYKT
DSSP	.HHHHHHHHHHHHHHHH..TTS........HHHHHHHHHHHHHHHHHHHH
PSIPRED	CHHHHHHHHHHHHHHHCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCC
JPred	-HHHHHHHHHHHH-----------------HHHHHHHHHHHHHHHHH---
Reference	HACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLS
1PHZ	HACYEHNHIFPLLEKYCGFREDNIPQLEDVSQFLQTCTGFRLRPVAGLLS
DSSP	HB.HHHHHHHHHHHHHS..BTTB...HHHHHHHHHHHT..EEEE.SS...
PSIPRED	CHHHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHHCCEEEECCCCCC
JPred	--HHHHHHHHHHHHHH----------HHHHHHHHHHH---EEEE------
Reference	SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
1PHZ	SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
DSSP	HHHHHHHHTTTEEEE......TT.TT..SS..HHHHHTTTTTTTTSHHHH
PSIPRED	HHHHHHHCCCCEECCCEEEECCCCCCCCCCCCHHHHHHCCCCCCCCCHHH
JPred	HHHHHHHH----EEEEEEE-----------HHHHHHHH--------HHHH
Reference	QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS
1PHZ	QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKEGDSIKAYGAGLLSS
DSSP	HHHHHHHHHHTT..HHHHHHHHHHHHTTTTT.EEEETTEEEE..HHHHT.
PSIPRED	HHHHHHHHHCCCCCHHHHHHHHHHEEEEEEEEEECCCCCEEEECCCCCCC
JPred	HHHHHHHHHHH---HHHHHHHHH-HHHEEEEEEEEE---EEEEE------
Reference	FGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVR
1PHZ	FGELQYCLSDKPKLLPLELEKTACQEYSVTEFQPLYYVAESFSDAKEKVR
DSSP	HHHHHHTTSSSS..EE..HHHHTT....SSS..S..EEES.HHHHHHHHH
PSIPRED	HHHHHHHHCCCCCCCCCCHHHHHCCCCCCCCCCEEEEEECCHHHHHHHHH
JPred	HHHHHHHH-----EE---HHHHH-----------EEEE---HHHHHHHHH
Reference	NFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK
1PHZ	TFAATIPRPFSVRYDPYTQRVEVLDNT-----------------------
DSSP	HHHHTS..SSEEEEETTTTEEEEE.HHHHHHHHHHHHHHHHHHHHHHHHH
PSIPRED	HHHHHCCCCCEEEECCCCCEEEECCCHHHHHHHHHHHHHHHHHHHHHHHH
JPred	HHHHHH------------EEEEE---HHHHHHHHHHHHHHHHHHHHHHHH
Reference	IK
1PHZ	--
DSSP	T.
PSIPRED	HC
JPred	--

Difference between revisions of "Task 3: Sequence-based predictions"

Revision as of 08:43, 30 August 2011

Contents

Task description

Task 3.1: Secondary structure prediction

PSIPRED

JPred3

DSSP

Result

Discussion

Task 3.2: Prediction of disordered regions

DISOPRED

POODLE

IUPRED

META-Disorder

Result

Discussion

Task 3.3: Prediction of transmembrane alpha-helices and signal peptides

Annotated sequence features

PAH

BACR_HALSA

RET4_HUMAN

INSL5_HUMAN

LAMP1_HUMAN

A4_HUMAN

General Questions to prediction of transmembrane alpha-helices and signal peptides

Why is the prediction of transmembrane helices and signal peptides grouped together here?

Description of different signal peptides

Signalpeptides for the import to the endoplasmic reticulum (ER)

Signalpeptides for the import to the mitochondrion

Signalpeptides for the import to the chloroplast

Signalpeptides for the import to the peroxisome

Signalpeptides for the import to the nucleus and the export form the nucleus

TMHMM

Details of the method

Description

Predicted features

Required information for the prediction

Execution

Results and discussion

PAH

BACR_HALSA

RET4_HUMAN

INSL5_HUMAN

LAMP1_HUMAN

A4_HUMAN

Phobius

Details of the method

Description

Predicted features

Required information for the prediction

Execution

Results and discussion

PAH

BACR_HALSA

RET4_HUMAN

INSL5_HUMAN

LAMP1_HUMAN

A4_HUMAN

PolyPhobius

Details of the method

Description

Predicted features

Required information for the prediction

Execution

Results and discussion

PAH

BACR_HALSA

RET4_HUMAN

INSL5_HUMAN

LAMP1_HUMAN

A4_HUMAN

OCTOPUS

Details of the method

Description

Predicted features

Required information for the prediction

Execution

Results and discussion

PAH