Hemochromatosis: Sequence based predictions

From Bioinformatikpedia

Lab Journal

Secondary Structure

In this task, the secondary structure of proteins is predicted using the programs reProf and PsiPred. The results are then compared to the DSSP[1] secondary structure assignments of corresponding crystal structures from the PDB.

<figtable id="sequences">

Uniprot PDB
UID Name Length ID Resolution [A] Chain Length
Q30201 Hereditary hemochromatosis protein 348 A16Z 2.60 A 275
P10775 Ribonuclease inhibitor 456 2BNH 2.30 A 457
Q9X0E6 Divalent-cation tolerance protein CutA 101 1VHF 1.54 A 113
Q08209 CAM-PRP catalytic subunit 521 1M63 2.80 A 372
Table 1: List of the four proteins for which predictions were perfomed in this task. The given PDB structures were used as reference.


For this purpose, the crystal structures listed in <xr id="sequences"/> were selected for comparison. The first priority for selection was to get the protein in its native state and not bound to another molecule. The other criteria were the quality of the structure and the alignment to the protein sequence (having one continuous segment).

In the next step, the output of the prediction programs and the DSSP assignments have to be made comparable. DSSP assigns 8 different classes of secondary structure, whereas reProf and PsiPred only predict helix (H), sheet (E) and loop (L or C). The exact mappings can be found in the lab journal.

To assess the quality of the prediction, the class specific accuracy, coverage and F1-measure were used along with the Q3 and SOV3 <ref>Zemla. A, et al.;A Modified Definition of Sov, a Segment-Based Measure for Protein Secondary Structure Prediction Assessment; PROTEINS 34:220–223 (1999)</ref> measures. The Q3 is defined as follows:

Q3 formula.gif

Reprof takes as input either sequences or PSSMs. Therefore, PSSMs were generated by querying the HFE (Q30201) sequence against the big_80, SwissProt and PDB databases. The tool of choice for this was PsiBlast with standard parameters (2 iterations and e-value cutoff of 0.002).

<figtable id="prediction quality">

Prediction methods
Reprof + Sequence Reprof + Big80 Reprof + SwissProt Reprof + PDB Psipred
Q3 0.76 0.81 0.86 0.84 0.84
SOV3 0.66 0.75 0.84 0.84 0.73
Acc Cov F1 Acc Cov F1 Acc Cov F1 Acc Cov F1 Acc Cov F1
H 0.63 0.33 0.44 0.74 0.48 0.59 0.84 0.65 0.74 0.85 0.52 0.64 0.98 0.77 0.86
C 0.57 0.69 0.63 0.63 0.75 0.68 0.68 0.81 0.74 0.64 0.83 0.72 0.61 0.95 0.74
E 0.79 0.63 0.70 0.81 0.84 0.83 0.89 0.86 0.87 0.88 0.85 0.86 0.94 0.55 0.69
Table 2: Quality of the Reprof predictions based on different inputs and the Psipred prediction for HFE. The best Q3, SOV and F1 values are marked in bold.


The deciding criteria for the Reprof method choice were the Q3, SOV3 and F1-measures. For the Q3 and the SOV3, the SwissProt PSSM combination performed best and the F1-measures were also reasonable (see <xr id="prediction quality"/>). Therefore, this method was chosen for all future predictions.

The DSSP assignments together with the ReProf and PsiPred predictions for each of the four proteins can be found here.

<figtable id="comparison">

Protein Method Q3 SOV3
Q08209 ReProf 0.83 0.78
Psipred 0.87 0.79
Q30201 ReProf 0.86 0.84
Psipred 0.84 0.73
P10775 ReProf 0.91 0.93
Psipred 0.93 0.94
Q9X0E6 ReProf 0.75 0.65
Psipred 0.89 0.86
Table 3: Comparison of ReProf and PsiPred predictions for all four proteins.


<xr id="comparison"/> shows a comparison of the ReProf and PsiPred prediction quality. The predictions were compared to the DSSP assignment of the corresponding pdb structures. While SOV3 and Q3 do not always agree which of the two methods performs better, they do so most of the time. Nevertheless, it was decided that SOV3 is to be trusted more than Q3, because it takes into account the per segment accuracy rather than just the per residue accuracy. Thus, PsiPred outperforms reprof in 3 out of 4 cases. It is also notable, that the SOV3 values range from 0.65 for Q9X0E6 (101 residues) to 0.94 for P10775 (456 residues). So the performance of the method depends on the length of the protein and the protein itself, but PsiPred performed best overall.


We searched DisProt for the best matches to the four proteins, but there was only one direct match for Q08209. We used the PsiBlast and Smith Waterman search to find matches for the other three proteins. The first best matches are listed in <xr id="disprot id"/>.


  1. dis { border-collapse:collapse }
  2. dis td { border: black 1px solid }
  3. dis th { border: black 1px solid }


<figtable id="disprot id">

UID DisProt ID SeqID E-value Method
Q30201 DP00670 0.29 2e-30 psiblast
P10775 DP00554 0.4 7e-30 psiblast
Q9X0E6 DP00564 0.25 0.36 smith-waterman
Q08209 DP00092 - - direct match
Table 4: DisProt matches for Q30201, P10775, Q9X0E6 and Q08209. The 'method' column specifies how the match was found.


IUPred is a disorder prediction server that predicts three different types:

  • long disorder
  • short disorder
  • global (globular not disordered domains)

When predicting the long and short types of disorder, the output contains a likelihood of disorder for each residue and the globular domain prediction contains the start and end positions.

MetaDisorder (MD) is a predictor that is based on several prediction programs from predict protein. It combines those predictions into one value for each residue, which is the likelihood of being part of a disordered region.

<figure id=disorder pred1">

Disorder plot Q30201.png Hemo plot P10775.png
Q30201 glob.png P10775 globular.png


<figure id="disorder pred2">

Hemo plot Q9X0E6.png Hemo plot Q08209.png
Q9X0E6 globular.png Q08209 globular.png
Figure 2: Disprot and Metadisorder predictions. The upper plot for each protein contains the local and global predictions from IUPred, the MetaDisorder prediction and the DisProt annotation of the corresponding match in the database. The upper part of the top plot with the red and blue bars is the DisProt annotation. The lower plot contains the global IUPred prediction, where the blue bar indicates the globular domain. The x-axes of both plots show the protein sequence and the y-axes the tendency of each residue for being disordered. The scale ranges from 0 to 1.


All predictions are combined into plots in <xr id="disorder pred2"/>. A residue is predicted to be disordered if its likelihood is higher than 0.5, indicated by the gray horizontal line.

Unfortunately we could not find good matches in the DisProt database, apart from the direct match for Q08209. The matches for Q30201 and Q9X0E6 have a sequence identity below 30% and the e-value for Q9X0E6 is too high with 0.36. But the hit for P10775 with a sequence identity of 40% and an e-value of 7e-30 can possibly be used for an annotation transfer. Nevertheless, we included the annotations for the Q30201 and Q9X0E6 matches for the sake of completeness.

All predictions for the HFE protein Q30201 show nearly no disordered regions. The MD tendencies are always below 0.5 and there is only a short region after residue 250 where IUPred predicted long and short disordered regions. The DisProt annotation contains a small disordered region after residue 150, but this region can be neglected due to the low sequence identity to the corresponding DisProt protein. IUPred predicted, that the HFE protein consists of a single globular not disordered domain. In summary, the HFE protein is correctly predicted to be not disordered.

P10775 was also predicted to be not disordered by all methods. The region of the DisProt entry DP00465 that matched P10775 does not contain a disordered region, so the predictions are also right in this case.

The DisProt annotation of the entry DP00564 MetaDisorder likelihoods are higher, always above 0.3. But we trust DisProt the most, since all IUPred predictions, including the global, agree that Q9X0E6 is not disordered and DP00564 has an e-value of 0.36.

In contrast to the first three proteins, Q08209 is clearly disordered. The DisProt annotation of the protein contains several disordered regions near the C-terminus. IUPred's global prediction agrees that the C-terminus is disordered, and also the long, short and MetaDisorder predictions are above 0.5.

All N and C-termini of the proteins have short disorder tendencies with values above 0.5, but since the ends of all protein chains can be characterized as disordered to some extent, we would not count that as false predictions.

Transmembrane Helices


  1. dis { border-collapse:collapse }
  2. dis td { border: black 1px solid }
  3. dis th { border: black 1px solid }

</css> <figtable id="pdb structures">

Uniprot PDB
UID Name Length ID Resolution [A] Chain Length
Q30201 Hereditary hemochromatosis protein 348 A16Z 2.60 A 275
P35462 D(3) dopamine receptor 400 3PBL 2.89 A 400
Q9YDF8 Voltage-gated potassium channel 295 1ORQ 3.20 C 223
1ORS 1.90 C 132
P47863 Aquaporin-4 323 2D57 3.20 A 301
Table 5: Description of the four proteins Q30201, Q9YDF8, P35462 and P47863 and the corresponding pdb structures with the highest resolution.



The HFE protein is a single pass membrane protein and thus located at the membrane. Nevertheless, there are no entries in the OPM and PDBTM databases. A single pass membrane protein is a protein that has its N-terminus in the extracellular space and the C-terminus inside the mebrane. The N-terminal signal sequence is cleaved off after the translocation.

<figtable id="TM hfe">

HFE (Q30201)
Memsat Polyphobius OPM PDBTM
Hemo memsat Q30201.png
Hemo poly Q30201.png
- -
Table 6:Visualisation of the Memsat and Polyphobius predictions for HFE.


Memsat predicts one transmembrane helix for the HFE protein and also a signal peptide consisting of the first 12 N-terminal residues (see <xr id="TM hfe"/>). The signal peptide is actually longer, it spans the first 22 residues (Uniprot), but the transmembrane helix from residue 306 to 329 is correctly predicted (307 to 330 in Uniprot).

Polyphobius prediction for HFE:

ID   sp|Q30201|HFE_HUMAN
FT   SIGNAL        1     25
FT   REGION        1      7       N-REGION.
FT   REGION        8     19       H-REGION.
FT   REGION       20     25       C-REGION.
FT   TOPO_DOM     26    304       NON CYTOPLASMIC.
FT   TRANSMEM    305    329
FT   TOPO_DOM    330    348       CYTOPLASMIC.

Polyphobius is even more accurate in its prediction. The predicted signal sequence ranges from residue 1 to 25 and the transmembrane region from 305 to 329, which is only slightly shifted. The graphical prediction plot in <xr id="TM hfe"/> shows the transmembrane region in grey and the signal peptide in red. The cytoplasmic region is indicated by the green line and the non cytoplasmic region in blue.

Both prediction programs were able to predict the transmembrane helix topology of the HFE protein correct.

D(3) dopamine receptor

A D(3) dopamine receptor is a located in the cell membrane of neurons. It is a multi-pass membrane protein that spans the membrane more than once.

<figtable id="TM P35462">

D(3) dopamine receptor (P35462)
Memsat Polyphobius OPM PDBTM
Hemo memsat P35462.png
Hemo poly P35462.png
3pbl opm.png
Hemo 3pbl lm.png
Table 6:Visualisation of the Memsat and Polyphobius predictions for the D(3) dopamine receptor (P35462) and the assignments in OPM and PDBTM.


The OPM and PDBTM databases contain membrane assignments for crystal structures of known transmembrane proteins. The reference assignments for 3PBL are shown in <xr id="TM P35462"/> In the OPM visualisation, the extracellular side of the membrane is marked with red and the cytoplasmic side with blue. The N-terminus of the protein is assigned to be extracellular. PDBTM colors the cytosolic part of the protein in red and the extracellular part in blue. Transmembrane helices are colored in yellow. Both databases have 7 transmembrane helices assigned to the structure of the D(3) dopamine receptor.


  1. dis { border-collapse:collapse }
  2. dis td { border: black 1px solid }
  3. dis th { border: black 1px solid }

</css> <figtable id="helices assignment P35462">

# TM helix OPM PDBTM
1 34-52 35-52
2 67-91 68-84
3 101-126 109-123
4 150-170 152-166
5 187-209 191-206
6 330-351 334-347
7 363-386 368-382
Table 4: Location of the seven helices in 3PBL as assigned in OPM and PDBTM .


Memsat predicts a signal peptide from positions 1 to 29. However, the Uniprot entry does not contain a signal peptide. Besides, Memsat correctly locates the N-terminus of the sequence in the extracellular space and predicts 6 transmembrane helices in total. The positions of all transmambrane helices greatly overlap with those of the OPM and PDBTM assignments, only the 7th helix is not predicted.

The Polyphobius prediction is the following:

ID   sp|P35462|DRD3_HUMAN
FT   TOPO_DOM      1     29       NON CYTOPLASMIC.
FT   TRANSMEM     30     55
FT   TOPO_DOM     56     65       CYTOPLASMIC.
FT   TRANSMEM     66     88
FT   TOPO_DOM     89    104       NON CYTOPLASMIC.
FT   TRANSMEM    105    126
FT   TOPO_DOM    127    149       CYTOPLASMIC.
FT   TRANSMEM    150    170
FT   TOPO_DOM    171    187       NON CYTOPLASMIC.
FT   TRANSMEM    188    212
FT   TOPO_DOM    213    328       CYTOPLASMIC.
FT   TRANSMEM    329    352
FT   TOPO_DOM    353    366       NON CYTOPLASMIC.
FT   TRANSMEM    367    386
FT   TOPO_DOM    387    400       CYTOPLASMIC.

The location of the N-terminus is correctly predicted as non cytoplasmic and all 7 transmembrane helices are also predicted at the right positions. The graphical plot also shows that there is no signal peptide. Thus, Polyphobius was able to determine the correct topology of the D(3) dopamine receptor.

Voltage-gated potassium channel

There are two different PDB structures that both match the Q9YDF8 sequence. The annotations in OMP and PDBTM differ slightly, thus we combined the annotations from both structures, see <xr id="helices assignment Q9YDF8"/>.

<figtable id="helices assignment Q9YDF8">

# TM helix OPM PDBTM
1 37-58 - 39-62 33-64
2 67-90 - 67-87 69-92
3 98-109 - 100-119 -
4 112-121 - - -
5 129-160 - 130-154 -
6 - 165-184 - 163-183
7 - 195-207 - 196-212 (re-entrant helix)
8 - 219-237 - 221-248
Table 4: Transmembrane helix assignments for 1ORS and 1ORQ in OPM and PDBTM .


The OPM assignment contains eight transmambrane helices for both structures, but PDBTM predicts six transmambrane helices and one loop that is actually a re-entrant helix. Both databases have the N-termini of the two PDB structures annotated as located in the cytoplasm.

<figtable id="TM Q9YDF8">

Voltage-gated potassium channel (Q9YDF8)
Memsat Polyphobius OPM (1ORQ) PDBTM (1ORQ)
Hemo memsat Q9YDF8.png
Hemo poly Q9YDF8.png
1orq opm.png
Hemo 1orq lm.png
Table 6:Visualisation of the Memsat and Polyphobius predictions for the voltage-gated potassium channel (Q9YDF8) and the assignments in OPM and PDBTM.


Memsat predicts six transmembrane helices and one re-entrant helix (see <xr id="TM Q9YDF8"/>. It combines the third and fourth OPM helix into one, but all other helices can be counted as corret. If compared to the PDBTM assignment, all predicted helices are at the right positions, even the re-entrant helix. The location of the C-terminus in the cytoplasm is also correct, but not the N-terminus. It actually is located in the extracellular space and not the cytoplasm.

Polyphobius predicted the following topology:

  FT   TOPO_DOM      1     41       NON CYTOPLASMIC.
  FT   TRANSMEM     42     60
  FT   TOPO_DOM     61     67       CYTOPLASMIC.
  FT   TRANSMEM     68     88
  FT   TOPO_DOM     89    107       NON CYTOPLASMIC.
  FT   TRANSMEM    108    129
  FT   TOPO_DOM    130    136       CYTOPLASMIC.
  FT   TRANSMEM    137    157
  FT   TOPO_DOM    158    162       NON CYTOPLASMIC.
  FT   TRANSMEM    163    184
  FT   TOPO_DOM    185    195       CYTOPLASMIC.
  FT   TRANSMEM    196    213
  FT   TOPO_DOM    214    223       NON CYTOPLASMIC.
  FT   TRANSMEM    224    244
  FT   TOPO_DOM    245    295       CYTOPLASMIC.

Seven transmembrane helices are predicted and the third and fourth OPM helix are also combined into one single helix (108 - 129), as in the Memsat output. Polyphobius does not predict a signal peptide and correcly locates the N and C-termini non-cytoplasmic and cytoplasmic, which is correct. So, the Polyphobius prediction does not match the PDBTM helix assignments as good as the Memsat prediction does, but on the other hand, the termini are located on the right sides.


Aquaporin 4 is a potein that forms a channel through the cell membrane consisting of six transmambrane helices. The pore allows water molecules to poass the cell membrane faster than by diffusion. <xr id="helices assignment P47863"/> lists the six transmambrane helices and two re-entrant helix assignments from OPM and PDBTM. Both, the N and C-terminus are located in the extracellular space.

<figtable id="helices assignment P47863">

# TM helix OPM PDBTM
1 34-56 39-55
2 70-88 72-89
3 98-107 95-106 (re-entrant)
4 112-136 116-133
5 156-178 158-177
6 189-203 188-205
7 214-223 209-222 (re-entrant)
8 231-252 231-248
Table 4: Location of the seven helices in 2D57 as assigned in OPM and PDBTM.


Both database annotations are very similar and contain the same number of helices, although OPM does not differentiate between transmambrane and re-entrant helices.

<figtable id="TM P47863">

Aquaporin-4 (P47863)
Memsat Polyphobius OPM PDBTM
Hemo memsat P47863.png
Hemo poly P47863.png
2d57 opm.png
Hemo 2d57 lm.png
Table 6:Visualisation of the Memsat and Polyphobius predictions for Aquaporin-4 (P47863) and the assignments in OPM and PDBTM.


Memsat wrongly predicts a signal peptide in the first 20 amino acids, but the six helices and even the re-entrant helices are predicted correct (see <xr id="TM P47863"/>. Polyphobius does predict all four transmambrane helices correct, but it does not predict re-entrant helices:

ID   sp|P47863|AQP4_RAT
FT   TOPO_DOM      1     33       CYTOPLASMIC.
FT   TRANSMEM     34     58
FT   TOPO_DOM     59     69       NON CYTOPLASMIC.
FT   TRANSMEM     70     91
FT   TOPO_DOM     92    114       CYTOPLASMIC.
FT   TRANSMEM    115    136
FT   TOPO_DOM    137    155       NON CYTOPLASMIC.
FT   TRANSMEM    156    177
FT   TOPO_DOM    178    187       CYTOPLASMIC.
FT   TRANSMEM    188    208
FT   TOPO_DOM    209    230       NON CYTOPLASMIC.
FT   TRANSMEM    231    252
FT   TOPO_DOM    253    323       CYTOPLASMIC.

Both termini are also correctly predicted . But, in contrary to Memsat, Polyphobius does not predict a signal peptide.


Memsat is a very good prediction tool, but it often predicts signal peptides although there are none. But apart from that and one wrong predicted N-terminus localisation, the predictions are very accurate and even re-entrant helices are reported correctly. Polyphobius, on the other hand, always predicted the signal peptides and termini localisations right. However, it does not predict re-entrant helices.

Signal Peptides


  1. dis2 { border-collapse:collapse }
  2. dis2 td { border: black 1px solid }
  3. dis2 th { border: black 1px solid; background-color:#adceff}


UID Q30201 P02768 P11279 P47863
SP id 13435 22229 17551 -
SP annotation 1-22 1-18 1-18 -
signalP prediction 1-22 1-18 1-18 no signal peptide
Table 7: Signal peptide predictions from signalP compared with the corresponding annotations in the signal peptide database. The C score is the cleavage site score, the S score is the raw signal peptide score and the Y score is a combination of the C and S scores.

For all four proteins, the signalP predictions were completely correct, which is remarkable.

  • ALBU_HUMAN, the human serum albumin, is the most abundant protein in human blood plasma. Its main responsibility is the regulation of the osmotic blood pressure and since it is a globular protein, it has no transmembrane helices.
  • The lysosome-associated membrane glycoprotein 1 (LAMP1_HUMAN) is a single pass membrane protein that presents carbohydrate ligands to selectins. The selectins are a family transmembrane glycoproteins that are involved in the inflammatory process.
  • Aquaporine 4 (AQP4_RAT) is an osmoreceptor that regulates body water balance in rats. It has no signal peptide and thus, the translocation signal is probably only recognizable in the folded structure of the protein, i.e. it is fragmented

across the sequence. Also, it is a membrane protein with 6 transmembrane alpha helices.

GO Terms


GO id Aspect Confidence GO term
GO:0004872 F 91% receptor activity
GO:0030106 F 88% MHC class I receptor activity
Table 5: GoPet results for the Q30201 sequence.

Although the prediction of GoPet is completely correct, it only recognises that the protein belongs to the MHC I receptor class. Any annotations hinting at the role of HFE in the iron metabolism are completely missing.

Unfortunately, ProtFun is currently not working and thus, results cannot be presented. For details please refer to the lab journal.

Pfam families The Pfam search for the 1A6Z_A structure of HFE revealed that it consists of two domains. The first domain is the MHC I binding domain that is responsible for the binding to ferritin. Most MHC I proteins are involved in the immune system and this specific domain and therefore need to be flexible. Therefore their sequence is not very well conserved as indicated by the low conservation values in the Pfam alignment. The second domain is the immunoglobulin C1-set domain, that corresponds to the constant domain of antibodies. This domain is better conserved than the MHC I domain, which reflects its function as a protein stabilizer. Beta-microglobulin, that is also part of a MHC I complex but not covalently bound to the protein, is also part of the Ig c1-set domain family.


<references />