Difference between revisions of "Sequence-based predictions TSD"

From Bioinformatikpedia
m (Comparison of predictions and annotation)
 
(100 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
<div style = "align:left;float: left;"> « Previous [[Sequence Alignments TSD]] </div> <div style = "align:right;float: right;"> [[Homology modelling TSD]] Next » </div>
'''Thor:''' He's my brother
 
  +
<br>
 
  +
<p>
'''Natasha Romanoff:''' He killed 80 people in 2 days
 
  +
If not noted otherwise, all predictions were conducted with the [[HEXA Reference sequence]] (Uniprot P06865). A protocol for this task can be found [[Sequence-based predictions Protocol TSD|here]].</p>
 
'''Thor:''' ...He's adopted
 
 
 
If not noted otherwise, the sequence for all predictions is the [[HEXA Reference sequence]] (Uniprot P06865). A protocol for this task can be found [[Sequence-based predictions Protocol TSD|here]].
 
   
 
== Secondary structure ==
 
== Secondary structure ==
Proteins: Ribonuclease inhibitor [http://www.uniprot.org/uniprot/P10775 P10775 ], CutA [http://www.uniprot.org/uniprot/Q9X0E6 Q9X0E6 ], CAM-PRP catalytic subunit [http://www.uniprot.org/uniprot/Q08209 Q08209]
+
Proteins: Ribonuclease inhibitor [http://www.uniprot.org/uniprot/P10775 P10775 ], Divalent-cation tolerance protein CutA (CutA) [http://www.uniprot.org/uniprot/Q9X0E6 Q9X0E6 ], Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform (CAM-PRP catalytic subunit) [http://www.uniprot.org/uniprot/Q08209 Q08209].
 
<br>
 
<br>
 
Ribonuclease inhibitor and CutA are located in the cytoplasm whereas the CAM-PRP catalytic subunit is located in the nucleus.
 
Ribonuclease inhibitor and CutA are located in the cytoplasm whereas the CAM-PRP catalytic subunit is located in the nucleus.
   
TODO maybe extend this a bit and change names, CutA and CAM-PRP tell me exactly nothing about what these are.
 
   
 
=== DSSP and handling of differing sequences ===
 
=== DSSP and handling of differing sequences ===
DSSP builds upon 3D structures, therefore a PDB entry has to be selected for every given Uniprot entry. The chosen mapping is 2bnh for P10775, 1kr4 for Q9X0E6, 1aui for Q08209 and 2gjx for P06865. This creates an additional problem. All other resources base their predictions on the Uniprot sequence. The sequence used by DSSP, inferred from the PDB file might be significantly different due to changes of the experimentalists solving the structure. There are automated ways to resolve this, using PDBs new mmCIF files, which provide a residue-level mapping between the atom recored inferred sequence and the SEQRES record sequence. From there one could use SIFTS which provides a residue-level mapping between SEQRES and Uniprot. However both these tools are automated and, while surely developed with great care, looking at the sequence might be considered favorable and will also directly point out interesting parts. Therefore manual alignments were performed and, if applicable, special cases noted in the text.
+
DSSP builds upon 3D structures, therefore a PDB entry has to be selected for every given Uniprot entry. The chosen mapping is 2bnh for P10775, 1kr4 for Q9X0E6, 1aui for Q08209 and 2gjx for P06865. This creates an additional problem. All other resources base their predictions on the Uniprot sequence. The sequence used by DSSP, inferred from the PDB file might be significantly different due to changes of the experimentalists solving the structure. There are automated ways to resolve this, using PDBs new mmCIF files, which provide a residue-level mapping between the atom recored inferred sequence and the SEQRES record sequence. From there one could use SIFTS which provides a residue-level mapping between SEQRES and Uniprot. However both resources, mmCIF mapping and SIFTS, are automated and, while surely developed with great care, looking at the sequence might be considered favorable and will also directly point out interesting parts. Therefore manual alignments were performed and, if applicable, special cases noted in the text.
   
 
=== Comparison of predictions and annotation ===
 
=== Comparison of predictions and annotation ===
In the following the secondary structure prediction of PSIPred and reprof is compared to the 3D-structure based annotation of DSSP. Uniprot annotation was not included, since its source is DSSP <ref name="uniprotsecstruct">http://www.uniprot.org/manual/helix</ref>. To compare the methods a common alphabet had to be established. All outputs were normalized to only contain the residues types H (Helix), E (Beta-strand) or C (Coil). Details on the parsing can be found in the [[Sequence-based_predictions_Protocol_TSD#Secondary_Structure|protocol]].
+
In the following the secondary structure prediction of PSIPred and reprof is compared to the 3D-structure based annotation of DSSP. Uniprot annotation was not included, since its source is DSSP <ref name="uniprotsecstruct">http://www.uniprot.org/manual/helix</ref>. To compare the methods a common alphabet had to be established. All outputs were normalized to only contain the residues types H (Helix), E (Beta-strand) or C (Coil). Details on the parsing can be found in the [[Sequence-based_predictions_Protocol_TSD#Secondary_Structure|protocol]]. The figures were created using the Latex package cpssp <ref name="cpssp">http://www.ctan.org/pkg/cpssp</ref>, details on these can also be found in the [[Sequence-based_predictions_Protocol_TSD#Figures|protocol]].
 
====P06865====
 
====P06865====
<xr id="fig:ssP06865"/> shows the comparison of assigned secondary structure for the Hex A alpha subunit. Several things are noticable here. Firstly, there are several short helices in the DSSP annotation that neither PSIPred nor reprof predicted. These short regions are not, as one might first assume, less common helix types, but in fact, as a check in the original DSSP output confirms 'normal' (i+4->i) alpha-helices. This issue cannnot be observed for beta-sheets where the length does not shown an immediate impact und prediction performance. Comparing PSIPred and reprof it can be easily observed that PSIPred shows very good correlation with the DSSP annotation while reprof often differs from these two. The disagreement is noticably stronger in the catalytic domain of the protein (cf. <xr id="fig:pfam2gjx"/>) and mostly appears in the prediction of beta-sheets instead of alpha-helices. This is surprising but also means that both methods correctly predict the central TIM barrel in this domain.
+
<xr id="fig:ssP06865"/> shows the comparison of assigned secondary structure for the Hex A alpha subunit. Several things are noticable here. Firstly, there are several short helices in the DSSP annotation that neither PSIPred nor reprof predicted. These short regions are not, as one might first assume, less common helix types, but in fact, as a check in the original DSSP output confirms 'normal' (i+4->i) alpha-helices. This issue cannnot be observed for beta-sheets where the length does not show an immediate impact on prediction performance. Comparing PSIPred and reprof it can be easily observed that PSIPred shows very good correlation with the DSSP annotation while reprof often differs from these two. The disagreement is noticeably stronger in the catalytic domain of the protein (beginning at position 167, cf. <xr id="fig:pfam2gjx"/>) and mostly appears in the prediction of beta-sheets instead of alpha-helices. This is surprising but also means that both prediction methods at least correctly predict the central TIM barrel in this domain.
   
 
<figure id="fig:ssP06865">
 
<figure id="fig:ssP06865">
[[Image:TSD P06865 plot-1.png|300px|thumb|<caption>'''asdasd'''</caption> <br> asd.]]
+
[[Image:TSD P06865 plot-1.png|300px|thumb|<font size="1"><div align="justify">'''Figure 1:''' Comparison of 2D Structure annotation and prediction for P06865</div></font>]]
 
</figure>
 
</figure>
  +
  +
167
  +
 
==== P10775 ====
 
==== P10775 ====
<xr id="fig:ssP10775"/> shows the comparison of assigned secondary structure for the ribonuclease inhibitor. It is apparent that both prediction methods largely agree with the annotation in DSSP, which is not entirely surprising given the peculiar horseshoe like secondary structure of alternating alpha-helices and beta-sheets that hardly leaves place for exchange or removal of a secondary structure element. Nonetheless, there are differences present, most noticeably the behaviour of reprof to either elongate helices or shift them to the left or right compared to the DSSP and PSIPred annotations. This leads to the fact that some of the beta strand-turn-alpha helix motifs are mistaken by reprof for something that would be more accurately described as alpha helix-turn-alpha helix motif. Given that these inhibitor are known to very tightly interact with the ribonucleases <ref name="Dickson2010">Dickson,K. and Haigis,M. (2005) Ribonuclease inhibitor: structure and function. Progress in nucleic acid research and, 6603, 1-23.</ref> it would be surprising to see that this does not significantly impair function and is therefore a very important finding.
+
<xr id="fig:ssP10775"/> shows the comparison of assigned secondary structure for the ribonuclease inhibitor. It is apparent that both prediction methods largely agree with the annotation in DSSP, which is not entirely surprising given the peculiar horseshoe like secondary structure of alternating alpha-helices and beta-sheets that hardly leaves place for exchange or removal of a secondary structure element (''c.f.'' <xr id="fig:2bnh"/>). Nonetheless, there are differences present, most noticeably the behaviour of reprof to either elongate helices or shift them to the left or right compared to the DSSP and PSIPred annotations. This leads to the fact that some of the beta strand-turn-alpha helix motifs are mistaken by reprof for something that would be more accurately described as alpha helix-turn-alpha helix motif. Given that these inhibitor are known to very tightly interact with the ribonucleases <ref name="Dickson2010">Dickson,K. and Haigis,M. (2005) Ribonuclease inhibitor: structure and function. Progress in nucleic acid research and, 6603, 1-23.</ref> it would be surprising to see that this does not significantly impair function and is therefore a very important finding.
   
 
<figure id="fig:ssP10775">
 
<figure id="fig:ssP10775">
[[Image:TSD P10775 plot-1.png|300px|thumb|<caption>'''asdasd'''</caption> <br> asd.]]
+
[[Image:TSD P10775 plot-1.png|300px|thumb|<font size="1"><div align="justify">'''Figure 2:''' Comparison of 2D Structure annotation and prediction for P10775</div></font>]]
 
</figure>
 
</figure>
  +
  +
<figure id="fig:2bnh">
  +
[[Image:TSD 2bnh.png|center|350px|thumb|<font size="1"><div align="justify">'''Figure 3: Ribonuclease inhibitor 3D structure''' <br> 3D structure of the ribonuclease inhibitor, according to PDB entry 2bnh. Highlights are yellow for beta-sheets, red for alpha-helices and green for coiled regions.</div></font>]]
  +
</figure>
  +
 
==== Q9X0E6 ====
 
==== Q9X0E6 ====
 
<xr id="fig:ssQ9X0E6"/> shows the comparison of assigned secondary structure for 'cation tolerance protein' CutA which reinforces the behaviours observed so far. While PSIPred's prediction is very close to the DSSP annotation, reprof does not predict some of the beta-sheets and tends to extend helices. Again this is important to note, since the beta-sheet assembly is thought to be essential for assumption of the trimeric biological assembly <ref name="Savchenko2009">Savchenko,A. and Skarina,T. (2004) X-Ray Crystal Structure of CutA From Thermotoga maritima at 1.4 Å Resolution. Proteins: Structure,, 54, 162-165.</ref>.
 
<xr id="fig:ssQ9X0E6"/> shows the comparison of assigned secondary structure for 'cation tolerance protein' CutA which reinforces the behaviours observed so far. While PSIPred's prediction is very close to the DSSP annotation, reprof does not predict some of the beta-sheets and tends to extend helices. Again this is important to note, since the beta-sheet assembly is thought to be essential for assumption of the trimeric biological assembly <ref name="Savchenko2009">Savchenko,A. and Skarina,T. (2004) X-Ray Crystal Structure of CutA From Thermotoga maritima at 1.4 Å Resolution. Proteins: Structure,, 54, 162-165.</ref>.
   
 
<figure id="fig:ssQ9X0E6">
 
<figure id="fig:ssQ9X0E6">
[[Image:TSD Q9X0E6 plot-1.png|300px|thumb|<caption>'''asdasd'''</caption> <br> asd.]]
+
[[Image:TSD Q9X0E6 plot-1.png|300px|thumb|<font size="1"><div align="justify">'''Figure 4:''' Comparison of 2D Structure annotation and prediction for Q9X0E6.</div></font>]]
 
</figure>
 
</figure>
 
==== Q08209 ====
 
==== Q08209 ====
Line 42: Line 45:
   
 
<figure id="fig:ssQ08209">
 
<figure id="fig:ssQ08209">
[[Image:TSD Q08209 plot-1.png|300px|thumb|<caption>'''asdasd'''</caption> <br> asd.]]
+
[[Image:TSD Q08209 plot-1.png|300px|thumb|<font size="1"><div align="justify">'''Figure 5:''' Comparison of 2D Structure annotation and prediction for Q08209.</div></font>]]
 
</figure>
 
</figure>
   
 
==== Conclusion ====
 
==== Conclusion ====
   
In conclusion it can be seen that a very general agreement with the annotation by DSSP can always be achieved at least by PSIPred. Mostly this predictions methods errors remain low and could be considered minor. Reprof shows much more detrimental errors, namely mistaking alpha-helices for beta-sheets or elongating and shifting predicted alpha-helices. Most of these errors can be shown to be unacceptable in terms of the known function of the particular structural element. While the last entry hinted at the fact that on some proteins both methods have almost equally large problems, PSIPred was much more convincing overall.
+
In conclusion it can be seen that a very general agreement with the annotation by DSSP can always be achieved at least by PSIPred. Mostly this prediction methods' errors remain low and could be considered minor.<br>
  +
Reprof shows much more detrimental errors, namely mistaking alpha-helices for beta-sheets or elongating and shifting predicted alpha-helices which can lead to a failure of predicting beta sheets. Most of these errors can be shown to be severe in terms of the known function of the particular structural element.<br>
  +
While the last entry hinted at the fact that on some proteins both methods have almost equally large problems, PSIPred was much more convincing overall.
  +
<br style="clear:both;">
   
 
== Disorder ==
 
== Disorder ==
   
The protein disorder prediction was performed with IUPred. The option long was chosen as prediction type as this is most suitable to find any disordered regions in a protein, which are long enough (>30 aa) to have an impact on protein structure.
+
The protein disorder prediction was performed with IUPred for the same proteins as in [[Sequence-based_predictions_TSD#Secondary_structure| section 1]]. The option long was chosen as prediction type as this is most suitable to find any disordered regions in a protein, which are long enough (>30 residues) to have an impact on protein structure.
   
 
<figtable id="tbl:iupred">
 
<figtable id="tbl:iupred">
{| class="wikitable" style="float: right; border: 2px solid darkgray; width:500px;" cellpadding="2"
+
{| class="wikitable" style="float: right; border: 2px solid darkgray; width:500px; margin-left:20px;" cellpadding="2"
 
! scope="row" align="left" |
 
! scope="row" align="left" |
 
| align="right" | [[File:Iupred P06865.png|thumb|200px]]
 
| align="right" | [[File:Iupred P06865.png|thumb|200px]]
Line 63: Line 69:
 
| align="right" | [[File:Iupred Q08209.png|thumb|200px]]
 
| align="right" | [[File:Iupred Q08209.png|thumb|200px]]
 
|-
 
|-
  +
|+ style="caption-side: bottom; text-align: left" |<font size="1"><div align="justify">'''Table 1:''' Protein disorder predictions for P06865, P10775, Q9X0E6 and Q08209 (top left to bottom right).</div></font>
! scope="row" align="left" |
 
| align="left" colspan="2" |'''Table ''': Protein disorder predictions.
 
|-
 
 
|}
 
|}
 
</figtable>
 
</figtable>
 
In the plots in <xr id="tbl:iupred"/> the calculated disorder tendency is displayed for every residue.
 
In the plots in <xr id="tbl:iupred"/> the calculated disorder tendency is displayed for every residue.
All predictions for the proteins named above express a fluctuating tendency which is overall lower than 0.5.<br>
+
All predictions for the proteins express a fluctuating tendency which is overall lower than 0.5.<br>
The prediction of CutA (<xr id="tbl:iupred"/>, bottom left) has the lowest and least fluctuating curve. Ribonuclease inhibitor (<xr id="tbl:iupred"/>, top right) and HexA (<xr id="tbl:iupred"/>, top left) have very similar profiles. None of these 3 proteins can be assigned a disordered region in accordance with the prediction. This complies with the known resolution of the structures of these proteins.
+
The prediction of CutA (<xr id="tbl:iupred"/>, bottom left) has the lowest and least fluctuating curve. Ribonuclease inhibitor (<xr id="tbl:iupred"/>, top right) and Hex A alpha subunit (<xr id="tbl:iupred"/>, top left) have very similar profiles. None of these 3 proteins can be assigned a disordered region in accordance with the prediction. This complies with the known resolution of the structures of these proteins: For the ribonuclease inhibitor, the leucine rich repeats leave very little room for disorder and CutA is very small where the only coiled regions seem to only directly connect the ordered regions (''c.f.'' <xr id="fig:1kr4"/>).
   
  +
<figure id="fig:1kr4">
The only protein breaking the 0.5 mark is CAM-PRP catalytic subunit (<xr id="tbl:iupred"/>, bottom right) that shows signs of disorder at the beginning of the sequence and at the end. The first region is about 10 residues long and the latter disorder prediction begins roughly at residue 425 and spans 100 amino acids.
 
  +
[[Image:TSD 1kr4.png|300px|thumb|<font size="1"><div align="justify">'''Figure 6: CutA 3D structure''' <br>3D structure of the cation tolerance protein CutA, according to PDB entry 1kr4. Highlights are yellow for beta-sheets, red for alpha-helices and green for coiled regions.</div></font>]]
  +
</figure>
  +
  +
The only protein exeeding the 0.5 cutoff is the CAM-PRP catalytic subunit (<xr id="tbl:iupred"/>, bottom right) which shows signs of disorder at the beginning of the sequence and towards the end. The first region is about 10 residues long and the latter begins roughly at residue 425 and spans 100 amino acids.
 
This prediction can be validated by the
 
This prediction can be validated by the
[http://www.disprot.org/protein.php?id=DP00092 Disprot annotations of CAM-PRP]. The assigned disorder regions are 1 - 13, 390 - 414 (CaM-binding domain), 374 - 468, 469 - 486 (Autoinhibitory region), 487 - 521. All were detected by X-ray crystallography.<br>
+
[http://www.disprot.org/protein.php?id=DP00092 annotations of CAM-PRP] in Disprot<ref name="disprot">Sickmeier,M. et al. (2007) DisProt: the Database of Disordered Proteins. Nucleic acids research, 35, D786-93.</ref>. The assigned disordered regions are 1 - 13, 390 - 414 (CaM-binding domain), 374 - 468, 469 - 486 (Autoinhibitory region) and 487 - 521. All were detected by X-ray crystallography.
The first disordered region was well detected by IUPred. The region starting at position 374 is represented by a peak in the prediction which might not have been recognized as real disorder. Further on the IUPred results hint at the distinction between the disordered region till position 486 and the disorder starting at 487 as the curve expresses a steep rise in this region.<br>
 
In conclusion IUPred supplies a very accurate and reliable prediction for the given protein set.
 
   
  +
The first disordered region was well detected by IUPred. The region starting at position 374 is hinted at by a peak in the prediction, however the signal is too strong to assume a disordered region of the size as annotated in Disprot. Furthermore, the IUPred results hint at the distinction between the disordered region till position 486 and the disorder starting at 487 as the curve expresses a steep rise in this region.
<br style="clear:both;">
 
   
  +
In conclusion, IUPred supplies a very accurate and reliable prediction for the given protein set.
<!-- not surprising given, that esp. P10775 hardly leaves room for disorder -->
 
  +
  +
<br style="clear:both;">
   
 
== Transmembrane helices ==
 
== Transmembrane helices ==
Proteins: Dopamine D3 receptor [http://www.uniprot.org/uniprot/P35462 P35462 ], KvAP [http://www.uniprot.org/uniprot/Q9YDF8 Q9YDF8 ], AQP-4 [http://www.uniprot.org/uniprot/P47863 P47863]
+
Proteins: Dopamine D3 receptor [http://www.uniprot.org/uniprot/P35462 P35462 ], Voltage-gated potassium channel (KvAP) [http://www.uniprot.org/uniprot/Q9YDF8 Q9YDF8 ], Aquaporin-4 (AQP-4) [http://www.uniprot.org/uniprot/P47863 P47863]
 
<br>
 
<br>
 
Dopamine D3 receptor, KvAP and AQP-4 are multi-pass membrane proteins.
 
Dopamine D3 receptor, KvAP and AQP-4 are multi-pass membrane proteins.
  +
<figtable id="tab:gopetgo">
 
  +
<!-- <figtable id="tab:gopetgo">
 
{| class="wikitable", style="width:750px; border-collapse: collapse; border-style: solid; border-width:0px; border-color: #000"
 
{| class="wikitable", style="width:750px; border-collapse: collapse; border-style: solid; border-width:0px; border-color: #000"
 
|-
 
|-
Line 120: Line 129:
 
'''Table TODO''': Assigned transmembrane regions in Uniprot
 
'''Table TODO''': Assigned transmembrane regions in Uniprot
 
</figtable>
 
</figtable>
  +
-->
  +
<!-- check the stuff above -->
  +
  +
=== Differing sequences and mapping of gold standard ===
  +
Since OPM and PDBTM rely on 3D structures to provide a TMH annotation, every Uniprot entry that is to be evaluated needs to be assigned a PDB entry. The mapping chosen is 3pbl for P35462, 1orq for Q9YDF8 and 2d57 for P47863, as well as 2gjx for P06865. As for the [[Sequence-based_predictions_TSD#DSSP_and_handling_of_differing_sequences|secondary structure prediction]] a problem lies in the mapping needed between the PDB ATOM record and a Uniprot sequence. For this task an automated approach was used: The PDB offers mmCIF files which contain a per residue level mapping of the ATOM record and the SEQRES sequence. In addition SIFTS <ref name="SIFTS">Velankar,S. et al. (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic acids research, 33, D262-5.</ref> provides a residue level mapping between a PDB SEQRES sequence and Uniprot sequences. This allows a transfer of the annotation in OPM and PDBTM onto the Uniprot sequences used.
  +
  +
=== Comparison of predictions and annotation ===
  +
Note that in the following PDBTM residue types 'H' and 'L' were considered as transmembrane helices. The figures were created using code from T. Nugent <ref name="drawtmh">http://www.cs.ucl.ac.uk/staff/T.Nugent/code.html</ref>. It should be noted that from the given data one cannot decide which side is extracellular and which intracellular, the distinction is simply due to the way the module works. The consensus only considers the OPM and PDBTM annotations.
  +
  +
Generally it can be seen in all of the following comparisons, that OPM and PDBTM usually agree on the presence of transmembrane helices, but the exact length and residue level position of the helices differs. This is to be expected, given that even provided with a 3D structure, the annotation of a helix is not a trivial task. Lipids or solvents used are too small and agile to be part of the resolved structure and even if they were present, it is a matter of the tool's author to decide at which part of the region building the transition between the membrane inside and extracellular surface the cut in the 2D annotation should be made. Owing to this problem, scores used to assess the performance of transmembrane helix prediction, having been proposed for a long time, do not penalize a prediction that is not 100% aligned with the 3D annotation but qualitatively count a helix as correctly predicted if there at least three overlapping residues and no other helix shares an overlap <ref name="chen2002">Chen,C.P. et al. (2002) Transmembrane helix predictions revisited. Protein Science, 11, 2774–2791.</ref>.
  +
  +
==== P35462 ====
  +
<figure id="fig:tmhP35462">
  +
[[Image:TSD P35462.png|300px|thumb|<font size="1"><div align="justify">'''Figure 7: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for P35462.''' <br>Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.</div></font>]]
  +
</figure>
  +
  +
<xr id="fig:tmhP35462"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for the dopamine receptor. There is high agreement between OPM and PDBTM and by the above mentioned scoring system, PolyPhobius correctly identifies all transmembrane helices. It seems from the figure that there are two additional helices towards the end of the sequence, that are overpredicted, however the hydrophobicity plot already hints that this might not be an error of PolyPhobius. Indeed manually checking the entries in OPM and PDBTM reveals, that these two helices do exist, and PolyPhobius correctly predicted them. They are note displayed because 3pbl is annotated as a chimera in SIFTS. The mapping to P35462 only extends up to residue 230 and then switches to P00720.
  +
  +
==== Q9YDF82 ====
  +
<figure id="fig:tmhQ9YDF82">
  +
[[Image:TSD Q9YDF82.png|300px|thumb|<font size="1"><div align="justify">'''Figure 8: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for Q9YDF82.''' <br>Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.</div></font>]]
  +
</figure>
  +
  +
  +
<xr id="fig:tmhQ9YDF82"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for the potassium channel. While all three methods agree on three C-terminal transmembrane helices, there are two N-terminal ones, that are predicted by PolyPhobius and present in PDBTM, but not annotated by OPM. Checking the OPM entry for 1orq it is revelaed that three N-terminal helices were actually explicitly excluded during creation of the database due to possible misalignments. The reference given for this decision in OPM <ref name="voltgate">Mackinnon,R. (2004) Structural biology. Voltage sensor meets lipid membrane. Science, 306, 1304-5.</ref> discusses the different theories of how the sensor works and in what way the helices are arranged in the open and closed formation. Indeed the literature agrees that there are in total six transmembrane helices in each monomer. Manual observation of a more recent structure 2r9r<ref name="">Long,S.B. et al. (2007) Atomic structure of a voltage-dependent K+ channel in a lipid membrane-like environment. Nature, 450, 376-82.</ref> supports this finding and also reveals that there is an additional re-entrant helix that does not cross the membrane and is oriented towards the inside of the tetramer.
  +
This is also recognized in the annotation by OPM and PDBTM (which explicitly mentions the re-entrant helix by residue type 'L'), both annotating a total of seven helices.<br>
  +
This suggests that the prediction performed by PolyPhobius could in fact be correct and a logical next step would be to assess the prediction performance using, not 1orq but 2r9r as a reference structure. However, a simple pairwise sequence alignment between 1orq:C and 2r9r:H reveals that the sequences are fairly dissimilar (Alignment length 201, similarity 49.8%) which is also the reason why there is no mapping available in SIFTS. This is due to the fact that 1orq describes the first voltage gate potassium channel found in ''Aeropyrum pernix''<ref name="1orq">Jiang,Y. et al. (2003) X-ray structure of a voltage-dependent K+ channel. Nature, 423, 33-41.</ref>, while 2r9r is a chimera of Kv1.2 and Kv2.1, both from mammals. In conclusion a satisfying assessment of this would require more time than available for now, but it should definitely be noted that PolyPhobius performance might not be as bad as it seems at first glance.
  +
  +
==== P47863 ====
  +
  +
<figure id="fig:tmhP47863">
  +
[[Image:TSD P47863.png|300px|thumb|<font size="1"><div align="justify">'''Figure 9: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for P47863.''' <br>Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.</div></font>]]
  +
</figure>
  +
  +
Finally, <xr id="fig:tmhP47863"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for aquaporin. Aquaporin is one of the archetypal structures for re-entrant helices, where two opposing ones form part of the central pore's surface <ref name="reentrant">Viklund,H. et al. (2006) Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes. Journal of molecular biology, 361, 591-603.</ref>. These are both annotated in OPM and PDBTM, however PolyPhobius misses them in the otherwise correct prediction. This could be due to the fact that these helices are comparably new and have not been known for a long time. While the re-entrant helix in the potassium channel was almost as long as a transmembrane helix, the two ones in aquaporin are much shorter making it hard to a method like PolyPhobius that was not created with these regions in mind, to identify them. Indeed, applying the newer MEMSAT-SVM <ref name="msvm">Nugent,T. and Jones,D.T. (2009) Transmembrane protein topology prediction using support vector machines. BMC bioinformatics, 10, 159.</ref> that was specifically trained for re-entrant helices shows that this method can identify all helices correctly.
  +
  +
==== P06865 ====
  +
PolyPhobius did not predict any transmembrane helices on the human Hex A alpha subunit which is correct. As can be seen in <xr id="fig:hexapp"/> there where no ambiguities and the soluble nature of the protein has been clearly identified. PolyPhobius does however find a signal peptide, which will be further discussed in the next [[Sequence-based_predictions_TSD#SignalP|section]].
  +
  +
<figure id="fig:hexapp">
  +
[[Image:TSD P06865 POLYP.png|300px|thumb|<font size="1"><div align="justify">'''Figure 10: PolyPhobius posterior probabilities for P06865.</div></font>]]
  +
</figure>
  +
  +
==== Conclusion ====
  +
In conclusion, PolyPhobius showed very good performance on the set of four proteins. Residue-level accuracy is not achieved but actually cannot be achieved even among the 'gold-standards' OPM and PDBTM and is therefore not an issue. A problem is presented though by the difficulties to recognize the recently discovered structural elements of re-entrant helices.
  +
  +
<br style="clear:both;">
   
 
== Signal peptides ==
 
== Signal peptides ==
Proteins: Serum albumin [http://www.uniprot.org/uniprot/P02768 P02768], LAMP-1 [http://www.uniprot.org/uniprot/P11279 P11279], AQP-4 [http://www.uniprot.org/uniprot/P47863 P47863]
+
Proteins: Serum albumin [http://www.uniprot.org/uniprot/P02768 P02768], Lysosome-associated membrane glycoprotein 1 (LAMP-1) [http://www.uniprot.org/uniprot/P11279 P11279], Aquaporin-4 (AQP-4) [http://www.uniprot.org/uniprot/P47863 P47863].<br> According to Uniprot, HEXA, LAMP-1 and Serum albumin contain a signal peptide.
<br>
 
HEXA LAMP-1 and Serum albumin contain a signal peptide.
 
 
LAMP-1 is a membrane protein which passes the membrane with one helix. Serum albumin, the main protein of plasma, is a secreted extracellular protein.
 
LAMP-1 is a membrane protein which passes the membrane with one helix. Serum albumin, the main protein of plasma, is a secreted extracellular protein.
AQP-4 is a multi-pass membrane protein which forms a waterspecific channel and functions in transport.
+
AQP-4 is a multi-pass membrane protein which forms a water-specific channel.
 
===SignalP===
 
===SignalP===
 
The prediction of the displayed results was performed with SignalP version 4.0.
 
The prediction of the displayed results was performed with SignalP version 4.0.
 
<br>
 
<br>
SignalP employs 3 main scores for the prediction of signal peptides, C, S and Y. The S-score stands for the actual signal peptide prediction, with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein.
+
SignalP employs 3 main scores for the prediction of signal peptides: C, S and Y. The S-score stands for the actual signal peptide prediction, with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein.
The C-score is the cleavage score, which indicates the best cleavage cite when significantly high. (When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein.)
+
The C-score is the cleavage score, which indicates the best cleavage site when significantly high. (When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein.)
Y-max is a derivative of the C-score combined with the S-score calculated to give a better cleavage site prediction than the raw C-score alone.
+
Y-max is a derivative of the C-score, combined with the S-score calculated to give a better cleavage site prediction than the raw C-score alone <ref name="sigpref">
  +
Bendtsen, J. D. ''et. al.'' (2004) Improved prediction of signal peptides: SignalP 3.0.
<br>
 
For non-secretory proteins all scores are supposed to be very low.
+
J. Mol. Biol., 340:783-795</ref>. For non-secretory proteins all scores are supposed to be very low.
  +
  +
==== Prediction results ====
   
 
<figtable id="tbl:signalp">
 
<figtable id="tbl:signalp">
{| class="wikitable" style="float: right; border: 2px solid darkgray; width:500px;" cellpadding="2"
+
{| class="wikitable" style="float: right; border: 2px solid darkgray; width:500px;margin-left:20px" cellpadding="2"
 
! scope="row" align="left" |
 
! scope="row" align="left" |
 
| align="right" | [[File:Sp_P06865_HEXA_HUMAN.png|thumb|200px]]
 
| align="right" | [[File:Sp_P06865_HEXA_HUMAN.png|thumb|200px]]
Line 144: Line 200:
 
! scope="row" align="left" |
 
! scope="row" align="left" |
 
| align="right" | [[File:Sp_P11279_LAMP1_HUMAN_TSD.png|thumb|200px]]
 
| align="right" | [[File:Sp_P11279_LAMP1_HUMAN_TSD.png|thumb|200px]]
| align="right" | [[File:Sp_P02768_ALBU_HUMAN_TSD.png|thumb|200px]]
+
| align="right" | [[File:Sp_P02768_ALBU_HUMAN_TSD.png|thumb|200px]]
|-
 
! scope="row" align="left" |
 
| align="left" colspan="2" |'''Table ''': Signal peptide predictions.
 
 
|-
 
|-
  +
|+ style="caption-side: bottom; text-align: left" |<font size="1"><div align="justify">'''Table 2:''' Signal peptide predictions by SignalPv4 for P06865, P47863, P11279 and P02768 (top left to bottom right).</div></font>
 
|}
 
|}
 
</figtable>
 
</figtable>
   
The <xr id="tbl:signalp"/> displays the results of the SignalP predictions and <xr id="tab:signals"/> gives a comparison of the predicted signal peptide positions and the validation from the [http://www.signalpeptide.de/ Signal Peptide Website]. The additional scores can be viewed [[SignalP scores TSD | here]].
+
<xr id="tbl:signalp"/> shows the results of the SignalP predictions and <xr id="tab:signals"/> gives a comparison of the predicted signal peptide positions and the validation from the Signal Peptide Website <ref name="signalpwebsite">http://www.signalpeptide.de</ref>. Additional scores can be viewed [[SignalP scores TSD | here]].
   
   
Line 178: Line 232:
 
| style="border-style: solid; border-width: 0 0 0 0" | -
 
| style="border-style: solid; border-width: 0 0 0 0" | -
 
|-
 
|-
  +
|+ <font size="1"><div align="justify">'''Table 3''': Comparison of signal peptide prediction and the assignment from the Signal Peptide Website.</div></font>
 
|}
 
|}
'''Table TODO''': Comparison of signal peptide prediction and the assignment from the Signal Peptide Website.
 
 
</figtable>
 
</figtable>
   
   
HEXA, LAMP-1 and Serum albumin are correctly predicted one signal peptide at the beginning of the sequence and AQP-4 is identified as a mature protein. Even the exact positions of the peptides are predicted accurately and thus the performance of SignalP turns out exceptionally satisfactory.
+
HEXA, LAMP-1 and Serum albumin are correctly predicted as having one signal peptide at the beginning of the sequence and AQP-4 is identified as a mature protein. Even the exact positions of the peptides are predicted accurately and thus the performance of SignalP turns out exceptionally satisfactory.
 
 
   
 
<br style="clear:both;">
 
<br style="clear:both;">
   
 
===Cross check with PolyPhobius===
 
===Cross check with PolyPhobius===
  +
PolyPhobius was specifically built to account for false predictions of signal peptides as transmembrane helices and should therefore be able to distinguish them. Indeed PolyPhobius perfectly predicts the signal peptides in Serum Albumin and LAMP-1 (where it also finds the single transmembrane helix at the C-terminus). For the Hex A alpha subunit, only a signal peptide is predicted, however the cleavage region already ends at position 19, instead of 22. For aquaporin only the transmembrane helices are predicted an no signal peptide is found.
  +
  +
In conclusion PolyPhobius performs very well. While the prediction of signal peptides is not on par with SignalPv4, this is not what PolyPhobius was designed to do. The actual discrimination between signal peptide and transmembrane helices worked flawlessly for the test proteins.
  +
<br style="clear:both;">
   
 
== GO terms ==
 
== GO terms ==
 
=== GOpet ===
 
=== GOpet ===
<xr id="tab:gopetgo"/> depicts the prediction results for the Hexa protein from GOpet. The predictions are all given with a very high confidence.
+
<xr id="tab:gopetgo"/> depicts the prediction results for the Hex A alpha subunit from GOpet. The predictions are all given with a high confidence of at least 61%. To validate the results [http://www.ebi.ac.uk/QuickGO QuickGO] and [http://amigo.geneontology.org AmiGO] were employed. As the GO annotations from these tools were all mostly inferred from electronic annotation (IEA) the predictions were additionally validated manually. <br>
  +
Hexosaminidase A is involved in the hydrolysis of terminal N-acetyl-D-hexosamine residues where only the alpha subunit is able to hydrolyse GM2 gangliosides. In presence of the cofactor GM2-activator protein (GM2AP) the alpha subunit of Hex A catalyses the removal of β-D-GalNAc from GM2. Thus the first 5 GO terms are considered correctly assigned. "Hydrolase activity hydrolyzing O-glycosyl compounds" is falsely predicted, but as it is not completely incorrect it should be viewed as merely a little shift away from the exact function.
Of the 35 GO terms which are associated with the HexA GOpet identified 6 correctly and 2 are falsely assigned.
 
  +
Hit number 6, the hydrolization of N-glycosyl compounds, is a very convincing assignment because it is not only true but also fairly specific.
  +
The prediction with the lowest confidence again proves correct as it is known that the Hex A alpha subunit forms a heterodimer with the Hex A beta-subunit. Altogether hexosaminidase activity and hydrolase activity, as part of more or less specific descriptions, dominate this GO term prediction. If the protein had been unknown, the GOpet prediction would have revealed many helpful functions which depict the protein accurately and already quite detailed.
  +
   
 
<figtable id="tab:gopetgo">
 
<figtable id="tab:gopetgo">
{| class="wikitable", style="width:700px; border-collapse: collapse; border-style: solid; border-width:0px; border-color: #000"
+
{| class="wikitable", style="width:950px; border-collapse: collapse; border-style: solid; border-width:0px; border-color: #000"
  +
|- align="center"
  +
! style="border-style: solid; border-width: 0 0 1px 0" | GO-Term ID
  +
! style="border-style: solid; border-width: 0 0 1px 0" | Type
  +
! style="border-style: solid; border-width: 0 0 1px 0" |Confidence
  +
! style="border-style: solid; border-width: 0 0 1px 0" |GO-Term description
  +
! style="border-style: solid; border-width: 0 0 1px 0" colspan="3" |Validation
 
|- align="center"
 
|- align="center"
! style="border-style: solid; border-width: 0 0 2px 0" | GO-Term ID
+
! style="border-style: solid; border-width: 0 0 2px 0" |
! style="border-style: solid; border-width: 0 0 2px 0" | Type
+
! style="border-style: solid; border-width: 0 0 2px 0" |
! style="border-style: solid; border-width: 0 0 2px 0" |Confidence
+
! style="border-style: solid; border-width: 0 0 2px 0" |
! style="border-style: solid; border-width: 0 0 2px 0" |GO-Term description
+
! style="border-style: solid; border-width: 0 0 2px 0" |
! style="border-style: solid; border-width: 0 0 2px 0" |Validation
+
! style="border-style: solid; border-width: 0 0 2px 0" |[http://www.ebi.ac.uk/QuickGO/GProtein?ac=P06865 QuickGO]
  +
! style="border-style: solid; border-width: 0 0 2px 0" |[http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:P06865&session_id=7846amigo1337613688 AmiGO]
  +
! style="border-style: solid; border-width: 0 0 2px 0" |Manual assessment
 
|- align="center"
 
|- align="center"
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0003824
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0003824
Line 209: Line 276:
 
| style="border-style: solid; border-width: 0 0 0 0" | 97%
 
| style="border-style: solid; border-width: 0 0 0 0" | 97%
 
| style="border-style: solid; border-width: 0 0 0 0" | catalytic activity
 
| style="border-style: solid; border-width: 0 0 0 0" | catalytic activity
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
Line 215: Line 284:
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | beta-N-acetylhexosaminidase activity
 
| style="border-style: solid; border-width: 0 0 0 0" | beta-N-acetylhexosaminidase activity
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
Line 222: Line 293:
 
| style="border-style: solid; border-width: 0 0 0 0" | hexosaminidase activity
 
| style="border-style: solid; border-width: 0 0 0 0" | hexosaminidase activity
 
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | false
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0016787
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0016787
Line 227: Line 300:
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
Line 233: Line 308:
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | 96%
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity acting on glycosyl bonds
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity acting on glycosyl bonds
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
Line 240: Line 317:
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity hydrolyzing O-glycosyl compounds
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity hydrolyzing O-glycosyl compounds
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
 
|-align="center"
 
|-align="center"
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0016799
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0016799
Line 246: Line 325:
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity hydrolyzing N-glycosyl compounds
 
| style="border-style: solid; border-width: 0 0 0 0" | hydrolase activity hydrolyzing N-glycosyl compounds
 
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | false
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-align="center"
 
|-align="center"
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0046982
 
| style="border-style: solid; border-width: 0 0 0 0" | GO:0046982
Line 251: Line 332:
 
| style="border-style: solid; border-width: 0 0 0 0" | 61%
 
| style="border-style: solid; border-width: 0 0 0 0" | 61%
 
| style="border-style: solid; border-width: 0 0 0 0" | protein heterodimerization activity
 
| style="border-style: solid; border-width: 0 0 0 0" | protein heterodimerization activity
  +
| style="border-style: solid; border-width: 0 0 0 0" | true
  +
| style="border-style: solid; border-width: 0 0 0 0" | false
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
| style="border-style: solid; border-width: 0 0 0 0" | true
 
|-
 
|-
  +
|+ <font size="1"><div align="justify">'''Table 4''': GO term prediction from GOpet.</div></font>
 
|}
 
|}
'''Table TODO''': GO term prediction from GOpet.
 
 
</figtable>
 
</figtable>
   
===ProtFun2.0===
+
===ProtFun2.2===
ProtFun2.0 employes various tools for the protein function prediction. A large number of feature prediction servers are queried such as SignalP to obtain information, which are integrated into final predictions of the cellular role, enzyme class, and selected Gene Ontology categories of the submitted sequence.<br>
+
ProtFun2.2 employes various tools for the ab initio protein function prediction. A large number of feature prediction servers such as SignalP are queried to obtain information about the submitted protein sequence. These are integrated into final predictions of the cellular role, enzyme class, and selected Gene Ontology categories. The classifications are based on two scores: The first score (influenced by the prior probability of that class) is the estimated probability that the entry belongs to the class in question. The second number (independent of the prior probability) is the odds that the sequence belongs to that class. The class with the highest information gain is chosen as prediction with the exception that ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1 <ref name="protfun"> http://www.cbs.dtu.dk/services/ProtFun-2.2/output.php</ref>. <br>
  +
In the following the ProtFun2.2 prediction for the hexosaminidase alpha-subunit are analysed. As supplementary information, there is a detailed depiction of the [[ProtFun2.2 output ]] available.<br>
The Gene Ontology categories are displayed in <xr id="tab:gopetgo"/>. There is no single prediction above 10%, thus the HexA is not attributed to any of these GO categories. With a closer examination of these classes it becomes clear that neither of them matches the function of our protein. <br>
 
Further on the HexA protein is predicted to be a enzyme more specifically Ligase (EC 6.-.-.-). <br>
 
"Cell_envelope" is chosen as the functional category with a probability of over 80%. This prediction seems to be the most accurate although it is not very apparent where this classification comes from and how it can be validated or further employed. The GO category prediction can be neglected for the HexA protein.
 
 
   
 
<figtable id="tab:protfun">
 
<figtable id="tab:protfun">
{| class="wikitable", style="width:300px; border-collapse: collapse; border-style: solid; border-width:0px; border-color: #000"
+
{| class="wikitable" style="float: right; border: 2px solid darkgray; width:500px;margin-left:20px" cellpadding="2"
|- align="left"
+
! scope="row" align="left" |
  +
| align="right" | [[File:ProtfunGO.png|thumb|200px]]
! style="border-style: solid; border-width: 0 0 2px 0" |Gene Ontology category
 
  +
| align="right" | [[File:ProtfunFuncCat.png|thumb|200px]]
! style="border-style: solid; border-width: 0 0 2px 0" | Probability
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" | Signal_transducer
 
| style="border-style: solid; border-width: 0 0 0 0" |8.3%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Receptor
 
| style="border-style: solid; border-width: 0 0 0 0" |10.5%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" | Hormone
 
| style="border-style: solid; border-width: 0 0 0 0" | 0.1%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" | Structural_protein
 
| style="border-style: solid; border-width: 0 0 0 0" | 1.0%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Transporter
 
| style="border-style: solid; border-width: 0 0 0 0" |2.4%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Ion_channel
 
| style="border-style: solid; border-width: 0 0 0 0" |1.8%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Voltage-gated_ion_channel
 
| style="border-style: solid; border-width: 0 0 0 0" |0.2%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Cation_channel
 
| style="border-style: solid; border-width: 0 0 0 0" |1.0%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Transcription
 
| style="border-style: solid; border-width: 0 0 0 0" |5.8%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Transcription_regulation
 
| style="border-style: solid; border-width: 0 0 0 0" |2.6%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Stress_response
 
| style="border-style: solid; border-width: 0 0 0 0" |4.4%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Immune_response
 
| style="border-style: solid; border-width: 0 0 0 0" |1.4%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Growth_factor
 
| style="border-style: solid; border-width: 0 0 0 0" |0.5%
 
|-
 
| style="border-style: solid; border-width: 0 0 0 0" |Metal_ion_transport
 
| style="border-style: solid; border-width: 0 0 0 0" |0.9%
 
 
|-
 
|-
  +
|+ style="caption-side: bottom; text-align: left" |<font size="1"><div align="justify">'''Table 5''': ProtFun2.2 predictions.</div></font>
 
|}
 
|}
'''Table TODO''': GO term prediction from from ProtFun2.0.
 
 
</figtable>
 
</figtable>
  +
  +
The Gene Ontology categories are displayed in <xr id="tab:protfun"/> (left). For the GO categories there is no single prediction above 10% and no entry receives an odds ratio greater than 1, thus Hex A is not attributed to any of the categories. With a closer examination of these classes it becomes clear that neither of them matches the function of our protein.
  +
  +
Further on, there is a prediction of cellular role provided which classifies the protein into 12 different functional categories based on a scheme developed by Monica Riley for E. coli in 1993 <ref name="protfunn">Riley, M. (1993) Functions of the gene products of Escherichia coli: Microbiol Rev. </ref>.
  +
Here the Hex A alpha subunit is assigned to "Cell_envelope" with a probability of over 80% (see <xr id="tab:protfun"/>, right) . This prediction seems to be the most accurate.<br>
  +
In addition, the Hex A subunit is classified to be an enzyme, more specifically a ligase (EC 6.-.-.-) with the probability of 8,5%. This assignment is surprising as there is an EC number, (EC 3.-.-.-), which receives a much higher probability of 32.9% but it is neglected due to its lower odds ratio. Here the selection according to the highest information content has clearly failed as the hexosaminidase indeed belongs to the hydrolases, enzyme class 3. The exact EC classification is [http://enzyme.expasy.org/EC/3.2.1.52 3.2.1.52].
  +
  +
The classification of the cellular role could pose as a hint in the right direction of the Hex A subunit function. In addition the GO classification is not actually false, the categories apparently just do not cover the whole range of protein functions and therefore the Hex A alpha subunit can not be assigned.
  +
Apart from that, the GO category and the enzyme number predictions can be disregarded for Hex A and therefore the performance of ProtFun2.2 can be assessed as rather unsatisfying in this particular case.
  +
  +
<br style="clear:both;">
   
 
==Pfam==
 
==Pfam==
Line 320: Line 369:
   
 
<figure id="fig:pfam">
 
<figure id="fig:pfam">
[[Image:PfamHexa.png|300px|thumb|<caption>''' HEXA Pfam domains.</caption>]]
+
[[Image:PfamHexa.png|300px|thumb|<font size="1"><div align="justify">'''Figure 11:''' HEXA Pfam domains.</div></font>]]
 
</figure>
 
</figure>
   
 
<figure id="fig:pfam2gjx">
 
<figure id="fig:pfam2gjx">
[[Image:TSD 2gjx a pfam.png|300px|thumb|<caption>''' Visualization of Pfam domains.'''</caption> <br> The Glyco hydro 20b domain is coloured green, the Glyco hydro 20 catalytic domain red and the predicted active site residue is shown as sticks and highlighted in purple. Regions that are not mapped to a Pfam domain are grey.]]
+
[[Image:TSD 2gjx a pfam.png|300px|thumb|<font size="1"><div align="justify">'''Figure 12: Visualization of Pfam domains.''' <br> The Glyco hydro 20b domain is coloured green, the Glyco hydro 20 catalytic domain red and the predicted active site residue is shown as sticks and highlighted in purple. Regions that are not mapped to a Pfam domain are grey.</div></font>]]
 
</figure>
 
</figure>
   
 
HEXA is almost completely spanned by these two domains. The Glyco hydro 20b domain (green) reaches from position 35 to position 165 and the Glyco hydro 20 domain (red) directly follows up, occupying a region from position 167 to 488. Left unmapped are a mostly coiled region at the beginning of the subunit and a helix followed by another coiled region at the end of the sequence. <xr id="fig:pfam2gjx"/> visualizes the Pfam annotation in the 3D structure of the Hex A alpha subunit.
 
HEXA is almost completely spanned by these two domains. The Glyco hydro 20b domain (green) reaches from position 35 to position 165 and the Glyco hydro 20 domain (red) directly follows up, occupying a region from position 167 to 488. Left unmapped are a mostly coiled region at the beginning of the subunit and a helix followed by another coiled region at the end of the sequence. <xr id="fig:pfam2gjx"/> visualizes the Pfam annotation in the 3D structure of the Hex A alpha subunit.
   
Both domain annotations are correct, albeit they don't seem very specific. The catalytic domain also belongs to the Pfam clan ''Glyco_hydro_tim'', consisting of glycosyl hydrolases that contain a TIM barrel fold. The TIM barrel can be seen in the middle of catalytic domain in <xr id="fig:pfam2gjx"/>. The clan is very large <ref name="pfamclan">[[Sequence-based_predictions_Protocol_TSD#Clan_statistics|Clan_statistics]]</ref> with 41 members and amongst others also contains a domain associated with [[Fabry:Sequence-based_analyses#Pfam|Fabry disease]].
+
Both domain annotations are correct, albeit they don't seem very specific. The catalytic domain also belongs to the Pfam clan ''Glyco_hydro_tim'', consisting of glycosyl hydrolases that contain a TIM barrel fold. The TIM barrel can be seen in the middle of catalytic domain in <xr id="fig:pfam2gjx"/>. The clan is very large <ref name="pfamclan">[[Sequence-based_predictions_Protocol_TSD#Clan_statistics|PFAM clan statistics]]</ref> with 41 members and amongst others also contains a domain associated with [[Fabry:Sequence-based_analyses#Pfam|Fabry disease]].
   
Interestingly Pfam also infers an active site residue E323 which is indeed thought to be important for catalytic activity as already outlined in the [[Tay-Sachs_Disease#Catalytic_activity|introduction]].
+
Interestingly, Pfam also infers an active site residue E323 which is indeed thought to be important for catalytic activity as already outlined in the [[Tay-Sachs_Disease#Catalytic_activity|introduction]].
   
 
In conclusion Pfam of course cannot provide the wealth of information the prediction methods claim to deliver, however its manual curation and high quality data in combination with the recently introduced step towards crowd sourcing Wikipedia <ref name="Punta2012a">Punta,M. et al. (2012) The Pfam protein families database. Nucleic acids research, 40, D290-301.</ref> make it an at least equally valuable resource.
 
In conclusion Pfam of course cannot provide the wealth of information the prediction methods claim to deliver, however its manual curation and high quality data in combination with the recently introduced step towards crowd sourcing Wikipedia <ref name="Punta2012a">Punta,M. et al. (2012) The Pfam protein families database. Nucleic acids research, 40, D290-301.</ref> make it an at least equally valuable resource.
  +
  +
<br style="clear:both;">
   
 
== References ==
 
== References ==

Latest revision as of 12:29, 31 August 2012


If not noted otherwise, all predictions were conducted with the HEXA Reference sequence (Uniprot P06865). A protocol for this task can be found here.

Secondary structure

Proteins: Ribonuclease inhibitor P10775 , Divalent-cation tolerance protein CutA (CutA) Q9X0E6 , Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform (CAM-PRP catalytic subunit) Q08209.
Ribonuclease inhibitor and CutA are located in the cytoplasm whereas the CAM-PRP catalytic subunit is located in the nucleus.


DSSP and handling of differing sequences

DSSP builds upon 3D structures, therefore a PDB entry has to be selected for every given Uniprot entry. The chosen mapping is 2bnh for P10775, 1kr4 for Q9X0E6, 1aui for Q08209 and 2gjx for P06865. This creates an additional problem. All other resources base their predictions on the Uniprot sequence. The sequence used by DSSP, inferred from the PDB file might be significantly different due to changes of the experimentalists solving the structure. There are automated ways to resolve this, using PDBs new mmCIF files, which provide a residue-level mapping between the atom recored inferred sequence and the SEQRES record sequence. From there one could use SIFTS which provides a residue-level mapping between SEQRES and Uniprot. However both resources, mmCIF mapping and SIFTS, are automated and, while surely developed with great care, looking at the sequence might be considered favorable and will also directly point out interesting parts. Therefore manual alignments were performed and, if applicable, special cases noted in the text.

Comparison of predictions and annotation

In the following the secondary structure prediction of PSIPred and reprof is compared to the 3D-structure based annotation of DSSP. Uniprot annotation was not included, since its source is DSSP <ref name="uniprotsecstruct">http://www.uniprot.org/manual/helix</ref>. To compare the methods a common alphabet had to be established. All outputs were normalized to only contain the residues types H (Helix), E (Beta-strand) or C (Coil). Details on the parsing can be found in the protocol. The figures were created using the Latex package cpssp <ref name="cpssp">http://www.ctan.org/pkg/cpssp</ref>, details on these can also be found in the protocol.

P06865

<xr id="fig:ssP06865"/> shows the comparison of assigned secondary structure for the Hex A alpha subunit. Several things are noticable here. Firstly, there are several short helices in the DSSP annotation that neither PSIPred nor reprof predicted. These short regions are not, as one might first assume, less common helix types, but in fact, as a check in the original DSSP output confirms 'normal' (i+4->i) alpha-helices. This issue cannnot be observed for beta-sheets where the length does not show an immediate impact on prediction performance. Comparing PSIPred and reprof it can be easily observed that PSIPred shows very good correlation with the DSSP annotation while reprof often differs from these two. The disagreement is noticeably stronger in the catalytic domain of the protein (beginning at position 167, cf. <xr id="fig:pfam2gjx"/>) and mostly appears in the prediction of beta-sheets instead of alpha-helices. This is surprising but also means that both prediction methods at least correctly predict the central TIM barrel in this domain.

<figure id="fig:ssP06865">

Figure 1: Comparison of 2D Structure annotation and prediction for P06865

</figure>

167

P10775

<xr id="fig:ssP10775"/> shows the comparison of assigned secondary structure for the ribonuclease inhibitor. It is apparent that both prediction methods largely agree with the annotation in DSSP, which is not entirely surprising given the peculiar horseshoe like secondary structure of alternating alpha-helices and beta-sheets that hardly leaves place for exchange or removal of a secondary structure element (c.f. <xr id="fig:2bnh"/>). Nonetheless, there are differences present, most noticeably the behaviour of reprof to either elongate helices or shift them to the left or right compared to the DSSP and PSIPred annotations. This leads to the fact that some of the beta strand-turn-alpha helix motifs are mistaken by reprof for something that would be more accurately described as alpha helix-turn-alpha helix motif. Given that these inhibitor are known to very tightly interact with the ribonucleases <ref name="Dickson2010">Dickson,K. and Haigis,M. (2005) Ribonuclease inhibitor: structure and function. Progress in nucleic acid research and, 6603, 1-23.</ref> it would be surprising to see that this does not significantly impair function and is therefore a very important finding.

<figure id="fig:ssP10775">

Figure 2: Comparison of 2D Structure annotation and prediction for P10775

</figure>

<figure id="fig:2bnh">

Figure 3: Ribonuclease inhibitor 3D structure
3D structure of the ribonuclease inhibitor, according to PDB entry 2bnh. Highlights are yellow for beta-sheets, red for alpha-helices and green for coiled regions.

</figure>

Q9X0E6

<xr id="fig:ssQ9X0E6"/> shows the comparison of assigned secondary structure for 'cation tolerance protein' CutA which reinforces the behaviours observed so far. While PSIPred's prediction is very close to the DSSP annotation, reprof does not predict some of the beta-sheets and tends to extend helices. Again this is important to note, since the beta-sheet assembly is thought to be essential for assumption of the trimeric biological assembly <ref name="Savchenko2009">Savchenko,A. and Skarina,T. (2004) X-Ray Crystal Structure of CutA From Thermotoga maritima at 1.4 Å Resolution. Proteins: Structure,, 54, 162-165.</ref>.

<figure id="fig:ssQ9X0E6">

Figure 4: Comparison of 2D Structure annotation and prediction for Q9X0E6.

</figure>

Q08209

Finally, <xr id="fig:ssQ08209"/> shows the comparison of assigned secondary structure for the phosphatase. This protein seems comparably hard to predict, both PSIPred and reprof make several errors. PSIPred does not predict any structural elements where there are none annotated according to DSSP, however some are mistaken for the opposite and serveral ones simply missed. On the other hand reprof correctly predicts some of these, but misses others and in addition again mistakes several long alpha helices for beta-sheets.

<figure id="fig:ssQ08209">

Figure 5: Comparison of 2D Structure annotation and prediction for Q08209.

</figure>

Conclusion

In conclusion it can be seen that a very general agreement with the annotation by DSSP can always be achieved at least by PSIPred. Mostly this prediction methods' errors remain low and could be considered minor.
Reprof shows much more detrimental errors, namely mistaking alpha-helices for beta-sheets or elongating and shifting predicted alpha-helices which can lead to a failure of predicting beta sheets. Most of these errors can be shown to be severe in terms of the known function of the particular structural element.
While the last entry hinted at the fact that on some proteins both methods have almost equally large problems, PSIPred was much more convincing overall.

Disorder

The protein disorder prediction was performed with IUPred for the same proteins as in section 1. The option long was chosen as prediction type as this is most suitable to find any disordered regions in a protein, which are long enough (>30 residues) to have an impact on protein structure.

<figtable id="tbl:iupred">

Iupred P06865.png
Iupred P10775.png
Iupred Q9X0E6.png
Iupred Q08209.png
Table 1: Protein disorder predictions for P06865, P10775, Q9X0E6 and Q08209 (top left to bottom right).

</figtable> In the plots in <xr id="tbl:iupred"/> the calculated disorder tendency is displayed for every residue. All predictions for the proteins express a fluctuating tendency which is overall lower than 0.5.
The prediction of CutA (<xr id="tbl:iupred"/>, bottom left) has the lowest and least fluctuating curve. Ribonuclease inhibitor (<xr id="tbl:iupred"/>, top right) and Hex A alpha subunit (<xr id="tbl:iupred"/>, top left) have very similar profiles. None of these 3 proteins can be assigned a disordered region in accordance with the prediction. This complies with the known resolution of the structures of these proteins: For the ribonuclease inhibitor, the leucine rich repeats leave very little room for disorder and CutA is very small where the only coiled regions seem to only directly connect the ordered regions (c.f. <xr id="fig:1kr4"/>).

<figure id="fig:1kr4">

Figure 6: CutA 3D structure
3D structure of the cation tolerance protein CutA, according to PDB entry 1kr4. Highlights are yellow for beta-sheets, red for alpha-helices and green for coiled regions.

</figure>

The only protein exeeding the 0.5 cutoff is the CAM-PRP catalytic subunit (<xr id="tbl:iupred"/>, bottom right) which shows signs of disorder at the beginning of the sequence and towards the end. The first region is about 10 residues long and the latter begins roughly at residue 425 and spans 100 amino acids. This prediction can be validated by the annotations of CAM-PRP in Disprot<ref name="disprot">Sickmeier,M. et al. (2007) DisProt: the Database of Disordered Proteins. Nucleic acids research, 35, D786-93.</ref>. The assigned disordered regions are 1 - 13, 390 - 414 (CaM-binding domain), 374 - 468, 469 - 486 (Autoinhibitory region) and 487 - 521. All were detected by X-ray crystallography.

The first disordered region was well detected by IUPred. The region starting at position 374 is hinted at by a peak in the prediction, however the signal is too strong to assume a disordered region of the size as annotated in Disprot. Furthermore, the IUPred results hint at the distinction between the disordered region till position 486 and the disorder starting at 487 as the curve expresses a steep rise in this region.

In conclusion, IUPred supplies a very accurate and reliable prediction for the given protein set.


Transmembrane helices

Proteins: Dopamine D3 receptor P35462 , Voltage-gated potassium channel (KvAP) Q9YDF8 , Aquaporin-4 (AQP-4) P47863
Dopamine D3 receptor, KvAP and AQP-4 are multi-pass membrane proteins.


Differing sequences and mapping of gold standard

Since OPM and PDBTM rely on 3D structures to provide a TMH annotation, every Uniprot entry that is to be evaluated needs to be assigned a PDB entry. The mapping chosen is 3pbl for P35462, 1orq for Q9YDF8 and 2d57 for P47863, as well as 2gjx for P06865. As for the secondary structure prediction a problem lies in the mapping needed between the PDB ATOM record and a Uniprot sequence. For this task an automated approach was used: The PDB offers mmCIF files which contain a per residue level mapping of the ATOM record and the SEQRES sequence. In addition SIFTS <ref name="SIFTS">Velankar,S. et al. (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic acids research, 33, D262-5.</ref> provides a residue level mapping between a PDB SEQRES sequence and Uniprot sequences. This allows a transfer of the annotation in OPM and PDBTM onto the Uniprot sequences used.

Comparison of predictions and annotation

Note that in the following PDBTM residue types 'H' and 'L' were considered as transmembrane helices. The figures were created using code from T. Nugent <ref name="drawtmh">http://www.cs.ucl.ac.uk/staff/T.Nugent/code.html</ref>. It should be noted that from the given data one cannot decide which side is extracellular and which intracellular, the distinction is simply due to the way the module works. The consensus only considers the OPM and PDBTM annotations.

Generally it can be seen in all of the following comparisons, that OPM and PDBTM usually agree on the presence of transmembrane helices, but the exact length and residue level position of the helices differs. This is to be expected, given that even provided with a 3D structure, the annotation of a helix is not a trivial task. Lipids or solvents used are too small and agile to be part of the resolved structure and even if they were present, it is a matter of the tool's author to decide at which part of the region building the transition between the membrane inside and extracellular surface the cut in the 2D annotation should be made. Owing to this problem, scores used to assess the performance of transmembrane helix prediction, having been proposed for a long time, do not penalize a prediction that is not 100% aligned with the 3D annotation but qualitatively count a helix as correctly predicted if there at least three overlapping residues and no other helix shares an overlap <ref name="chen2002">Chen,C.P. et al. (2002) Transmembrane helix predictions revisited. Protein Science, 11, 2774–2791.</ref>.

P35462

<figure id="fig:tmhP35462">

Figure 7: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for P35462.
Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.

</figure>

<xr id="fig:tmhP35462"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for the dopamine receptor. There is high agreement between OPM and PDBTM and by the above mentioned scoring system, PolyPhobius correctly identifies all transmembrane helices. It seems from the figure that there are two additional helices towards the end of the sequence, that are overpredicted, however the hydrophobicity plot already hints that this might not be an error of PolyPhobius. Indeed manually checking the entries in OPM and PDBTM reveals, that these two helices do exist, and PolyPhobius correctly predicted them. They are note displayed because 3pbl is annotated as a chimera in SIFTS. The mapping to P35462 only extends up to residue 230 and then switches to P00720.

Q9YDF82

<figure id="fig:tmhQ9YDF82">

Figure 8: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for Q9YDF82.
Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.

</figure>


<xr id="fig:tmhQ9YDF82"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for the potassium channel. While all three methods agree on three C-terminal transmembrane helices, there are two N-terminal ones, that are predicted by PolyPhobius and present in PDBTM, but not annotated by OPM. Checking the OPM entry for 1orq it is revelaed that three N-terminal helices were actually explicitly excluded during creation of the database due to possible misalignments. The reference given for this decision in OPM <ref name="voltgate">Mackinnon,R. (2004) Structural biology. Voltage sensor meets lipid membrane. Science, 306, 1304-5.</ref> discusses the different theories of how the sensor works and in what way the helices are arranged in the open and closed formation. Indeed the literature agrees that there are in total six transmembrane helices in each monomer. Manual observation of a more recent structure 2r9r<ref name="">Long,S.B. et al. (2007) Atomic structure of a voltage-dependent K+ channel in a lipid membrane-like environment. Nature, 450, 376-82.</ref> supports this finding and also reveals that there is an additional re-entrant helix that does not cross the membrane and is oriented towards the inside of the tetramer. This is also recognized in the annotation by OPM and PDBTM (which explicitly mentions the re-entrant helix by residue type 'L'), both annotating a total of seven helices.
This suggests that the prediction performed by PolyPhobius could in fact be correct and a logical next step would be to assess the prediction performance using, not 1orq but 2r9r as a reference structure. However, a simple pairwise sequence alignment between 1orq:C and 2r9r:H reveals that the sequences are fairly dissimilar (Alignment length 201, similarity 49.8%) which is also the reason why there is no mapping available in SIFTS. This is due to the fact that 1orq describes the first voltage gate potassium channel found in Aeropyrum pernix<ref name="1orq">Jiang,Y. et al. (2003) X-ray structure of a voltage-dependent K+ channel. Nature, 423, 33-41.</ref>, while 2r9r is a chimera of Kv1.2 and Kv2.1, both from mammals. In conclusion a satisfying assessment of this would require more time than available for now, but it should definitely be noted that PolyPhobius performance might not be as bad as it seems at first glance.

P47863

<figure id="fig:tmhP47863">

Figure 9: Comparison of OPM/PDBTM annotation and PolyPhobius prediction of transmembrane helices for P47863.
Consensus is built only on OPM and PDBTM annotations. Cytoplasmic and Extracellular cannot be distinguished and are present due to limitations during plotting. Kyte-Doolittle shows a hydrophobicity plot based on the homonymous hydrophobicity scale.

</figure>

Finally, <xr id="fig:tmhP47863"/> shows the comparison between OPM and PDBTM annotation and PolyPhobius prediction for aquaporin. Aquaporin is one of the archetypal structures for re-entrant helices, where two opposing ones form part of the central pore's surface <ref name="reentrant">Viklund,H. et al. (2006) Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes. Journal of molecular biology, 361, 591-603.</ref>. These are both annotated in OPM and PDBTM, however PolyPhobius misses them in the otherwise correct prediction. This could be due to the fact that these helices are comparably new and have not been known for a long time. While the re-entrant helix in the potassium channel was almost as long as a transmembrane helix, the two ones in aquaporin are much shorter making it hard to a method like PolyPhobius that was not created with these regions in mind, to identify them. Indeed, applying the newer MEMSAT-SVM <ref name="msvm">Nugent,T. and Jones,D.T. (2009) Transmembrane protein topology prediction using support vector machines. BMC bioinformatics, 10, 159.</ref> that was specifically trained for re-entrant helices shows that this method can identify all helices correctly.

P06865

PolyPhobius did not predict any transmembrane helices on the human Hex A alpha subunit which is correct. As can be seen in <xr id="fig:hexapp"/> there where no ambiguities and the soluble nature of the protein has been clearly identified. PolyPhobius does however find a signal peptide, which will be further discussed in the next section.

<figure id="fig:hexapp">

Figure 10: PolyPhobius posterior probabilities for P06865.

</figure>

Conclusion

In conclusion, PolyPhobius showed very good performance on the set of four proteins. Residue-level accuracy is not achieved but actually cannot be achieved even among the 'gold-standards' OPM and PDBTM and is therefore not an issue. A problem is presented though by the difficulties to recognize the recently discovered structural elements of re-entrant helices.


Signal peptides

Proteins: Serum albumin P02768, Lysosome-associated membrane glycoprotein 1 (LAMP-1) P11279, Aquaporin-4 (AQP-4) P47863.
According to Uniprot, HEXA, LAMP-1 and Serum albumin contain a signal peptide. LAMP-1 is a membrane protein which passes the membrane with one helix. Serum albumin, the main protein of plasma, is a secreted extracellular protein. AQP-4 is a multi-pass membrane protein which forms a water-specific channel.

SignalP

The prediction of the displayed results was performed with SignalP version 4.0.
SignalP employs 3 main scores for the prediction of signal peptides: C, S and Y. The S-score stands for the actual signal peptide prediction, with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein. The C-score is the cleavage score, which indicates the best cleavage site when significantly high. (When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein.) Y-max is a derivative of the C-score, combined with the S-score calculated to give a better cleavage site prediction than the raw C-score alone <ref name="sigpref">

Bendtsen, J. D. et. al. (2004) Improved prediction of signal peptides: SignalP 3.0.

J. Mol. Biol., 340:783-795</ref>. For non-secretory proteins all scores are supposed to be very low.

Prediction results

<figtable id="tbl:signalp">

Sp P06865 HEXA HUMAN.png
Sp P47863 AQP4 RAT TSD.png
Sp P11279 LAMP1 HUMAN TSD.png
Sp P02768 ALBU HUMAN TSD.png
Table 2: Signal peptide predictions by SignalPv4 for P06865, P47863, P11279 and P02768 (top left to bottom right).

</figtable>

<xr id="tbl:signalp"/> shows the results of the SignalP predictions and <xr id="tab:signals"/> gives a comparison of the predicted signal peptide positions and the validation from the Signal Peptide Website <ref name="signalpwebsite">http://www.signalpeptide.de</ref>. Additional scores can be viewed here.


<figtable id="tab:signals">

Peptide positions Prediction Validation
HexA 1-22 1-22
Serum Albumin 1-18 1-18
LAMP-1 1-28 1-28
AQP-4 - -
Table 3: Comparison of signal peptide prediction and the assignment from the Signal Peptide Website.

</figtable>


HEXA, LAMP-1 and Serum albumin are correctly predicted as having one signal peptide at the beginning of the sequence and AQP-4 is identified as a mature protein. Even the exact positions of the peptides are predicted accurately and thus the performance of SignalP turns out exceptionally satisfactory.


Cross check with PolyPhobius

PolyPhobius was specifically built to account for false predictions of signal peptides as transmembrane helices and should therefore be able to distinguish them. Indeed PolyPhobius perfectly predicts the signal peptides in Serum Albumin and LAMP-1 (where it also finds the single transmembrane helix at the C-terminus). For the Hex A alpha subunit, only a signal peptide is predicted, however the cleavage region already ends at position 19, instead of 22. For aquaporin only the transmembrane helices are predicted an no signal peptide is found.

In conclusion PolyPhobius performs very well. While the prediction of signal peptides is not on par with SignalPv4, this is not what PolyPhobius was designed to do. The actual discrimination between signal peptide and transmembrane helices worked flawlessly for the test proteins.

GO terms

GOpet

<xr id="tab:gopetgo"/> depicts the prediction results for the Hex A alpha subunit from GOpet. The predictions are all given with a high confidence of at least 61%. To validate the results QuickGO and AmiGO were employed. As the GO annotations from these tools were all mostly inferred from electronic annotation (IEA) the predictions were additionally validated manually.
Hexosaminidase A is involved in the hydrolysis of terminal N-acetyl-D-hexosamine residues where only the alpha subunit is able to hydrolyse GM2 gangliosides. In presence of the cofactor GM2-activator protein (GM2AP) the alpha subunit of Hex A catalyses the removal of β-D-GalNAc from GM2. Thus the first 5 GO terms are considered correctly assigned. "Hydrolase activity hydrolyzing O-glycosyl compounds" is falsely predicted, but as it is not completely incorrect it should be viewed as merely a little shift away from the exact function. Hit number 6, the hydrolization of N-glycosyl compounds, is a very convincing assignment because it is not only true but also fairly specific. The prediction with the lowest confidence again proves correct as it is known that the Hex A alpha subunit forms a heterodimer with the Hex A beta-subunit. Altogether hexosaminidase activity and hydrolase activity, as part of more or less specific descriptions, dominate this GO term prediction. If the protein had been unknown, the GOpet prediction would have revealed many helpful functions which depict the protein accurately and already quite detailed.


<figtable id="tab:gopetgo">

GO-Term ID Type Confidence GO-Term description Validation
QuickGO AmiGO Manual assessment
GO:0003824 Molecular function 97% catalytic activity true false true
GO:0004563 Molecular function 96% beta-N-acetylhexosaminidase activity true true true
GO:0015929 Molecular function 96% hexosaminidase activity false false true
GO:0016787 Molecular function 96% hydrolase activity true false true
GO:0016798 Molecular function 96% hydrolase activity acting on glycosyl bonds true false true
GO:0004553 Molecular function 96% hydrolase activity hydrolyzing O-glycosyl compounds true false false
GO:0016799 Molecular function 77% hydrolase activity hydrolyzing N-glycosyl compounds false false true
GO:0046982 Molecular function 61% protein heterodimerization activity true false true
Table 4: GO term prediction from GOpet.

</figtable>

ProtFun2.2

ProtFun2.2 employes various tools for the ab initio protein function prediction. A large number of feature prediction servers such as SignalP are queried to obtain information about the submitted protein sequence. These are integrated into final predictions of the cellular role, enzyme class, and selected Gene Ontology categories. The classifications are based on two scores: The first score (influenced by the prior probability of that class) is the estimated probability that the entry belongs to the class in question. The second number (independent of the prior probability) is the odds that the sequence belongs to that class. The class with the highest information gain is chosen as prediction with the exception that ProtFun refrains from marking GO categories if the score with the highest information content has odds lower than 1 <ref name="protfun"> http://www.cbs.dtu.dk/services/ProtFun-2.2/output.php</ref>.
In the following the ProtFun2.2 prediction for the hexosaminidase alpha-subunit are analysed. As supplementary information, there is a detailed depiction of the ProtFun2.2 output available.

<figtable id="tab:protfun">

ProtfunGO.png
ProtfunFuncCat.png
Table 5: ProtFun2.2 predictions.

</figtable>

The Gene Ontology categories are displayed in <xr id="tab:protfun"/> (left). For the GO categories there is no single prediction above 10% and no entry receives an odds ratio greater than 1, thus Hex A is not attributed to any of the categories. With a closer examination of these classes it becomes clear that neither of them matches the function of our protein.

Further on, there is a prediction of cellular role provided which classifies the protein into 12 different functional categories based on a scheme developed by Monica Riley for E. coli in 1993 <ref name="protfunn">Riley, M. (1993) Functions of the gene products of Escherichia coli: Microbiol Rev. </ref>. Here the Hex A alpha subunit is assigned to "Cell_envelope" with a probability of over 80% (see <xr id="tab:protfun"/>, right) . This prediction seems to be the most accurate.
In addition, the Hex A subunit is classified to be an enzyme, more specifically a ligase (EC 6.-.-.-) with the probability of 8,5%. This assignment is surprising as there is an EC number, (EC 3.-.-.-), which receives a much higher probability of 32.9% but it is neglected due to its lower odds ratio. Here the selection according to the highest information content has clearly failed as the hexosaminidase indeed belongs to the hydrolases, enzyme class 3. The exact EC classification is 3.2.1.52.

The classification of the cellular role could pose as a hint in the right direction of the Hex A subunit function. In addition the GO classification is not actually false, the categories apparently just do not cover the whole range of protein functions and therefore the Hex A alpha subunit can not be assigned. Apart from that, the GO category and the enzyme number predictions can be disregarded for Hex A and therefore the performance of ProtFun2.2 can be assessed as rather unsatisfying in this particular case.


Pfam

The Pfam-A sequence search reveals two significant Pfam-A domains within the Hex A alpha subunit: The Glycosyl hydrolase family 20, domain 2 and the Glycosyl hydrolase family 20, catalytic domain (see <xr id="fig:pfam"/>).

<figure id="fig:pfam">

Figure 11: HEXA Pfam domains.

</figure>

<figure id="fig:pfam2gjx">

Figure 12: Visualization of Pfam domains.
The Glyco hydro 20b domain is coloured green, the Glyco hydro 20 catalytic domain red and the predicted active site residue is shown as sticks and highlighted in purple. Regions that are not mapped to a Pfam domain are grey.

</figure>

HEXA is almost completely spanned by these two domains. The Glyco hydro 20b domain (green) reaches from position 35 to position 165 and the Glyco hydro 20 domain (red) directly follows up, occupying a region from position 167 to 488. Left unmapped are a mostly coiled region at the beginning of the subunit and a helix followed by another coiled region at the end of the sequence. <xr id="fig:pfam2gjx"/> visualizes the Pfam annotation in the 3D structure of the Hex A alpha subunit.

Both domain annotations are correct, albeit they don't seem very specific. The catalytic domain also belongs to the Pfam clan Glyco_hydro_tim, consisting of glycosyl hydrolases that contain a TIM barrel fold. The TIM barrel can be seen in the middle of catalytic domain in <xr id="fig:pfam2gjx"/>. The clan is very large <ref name="pfamclan">PFAM clan statistics</ref> with 41 members and amongst others also contains a domain associated with Fabry disease.

Interestingly, Pfam also infers an active site residue E323 which is indeed thought to be important for catalytic activity as already outlined in the introduction.

In conclusion Pfam of course cannot provide the wealth of information the prediction methods claim to deliver, however its manual curation and high quality data in combination with the recently introduced step towards crowd sourcing Wikipedia <ref name="Punta2012a">Punta,M. et al. (2012) The Pfam protein families database. Nucleic acids research, 40, D290-301.</ref> make it an at least equally valuable resource.


References

<references/>