Difference between revisions of "Sequence-based analyses Gaucher Disease"
|Line 610:||Line 610:|
=== P11279 ===
=== P11279 ===
For the protein Lysosome-associated membrane glycoprotein 1 ([http://www.uniprot.org/uniprot/P11279 P11279]), as expected, a long signal peptide was found with very high score:
For the protein Lysosome-associated membrane glycoprotein 1 ([http://www.uniprot.org/uniprot/P11279 P11279]), as expected, a long signal peptide was found with very high score:
# Measure Position Value Cutoff signal peptide?
# Measure Position Value Cutoff signal peptide?
Revision as of 01:42, 22 May 2012
- 1 Secondary structure
- 2 Disorder
- 3 Transmembrane helices
- 4 Signal peptides
- 5 GO terms
- 6 References
Knowing the secondary structure of a protein can shed light on its function since structure implies function. If the structure of a protein is known, secondary structure elements (helix, sheet, coiled) can be assigned to its residues depending on their affinity to form hydrogen bonds. DSSP <ref name="DSSP">Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers</ref> is the most common method to perform such secondary structure assignments. If the structure of a protein is unknown, secondary structure elements be be predicted by tools like PSIPRED <ref name="PSIPRED">Liam J. McGuffin, Kevin Bryson, and David T. Jones (2000). "The PSIPRED protein structure prediction server". Bioinformatics</ref> or Reprof<ref name="Reprof">B Rost, C Sander (1993). "Prediction of protein secondary structure at better than 70% accuracy". J. Mol. Bio.</ref>. The aim of this task was to analyse the secondary structure of different proteins and the compare the secondary structure predictions of PSIPRED and Reprof with the DSSP secondary structure assignments. Following sequences were taken into account: <figtable id="tab:ss_sequences">
|Divalent-cation tolerance protein CutA||Q9X0E6||1KR4|
</figtable> Information about program calls and implementation details can be found in our protocol.
For being able to better compare the different output formats, we mapped the secondary structure definitions of all three methods onto the three letters H (helix), E (sheet), and C (coiled) according to table <xr id="tab:ss_mapping"/>. Regions of the UniProt sequences which were not present in the PDB file as well as regions where no DSSP assignment was possible were ignored. <figtable id="tab:ss_mapping">
Glycosylcermidase (P04062) is located the the membrane of lysosomes. It exhibits two domains which belong to the (1) glycosyl hydrolase domain fold and (2) the TIM beta/alpha-barrel fold. Both domains have hydrophobic beta sheets which anchor the protein in the membrane. <xr id="fig:ss_P04062"/> depicts the secondary structure elements of the corresponding crystal structure which coincide with the DSSP assignments. The following section shows the secondary structure annotations of the different methods: The PSIPRED predictions better coincide with the DSSP assigments than the Reprof predictions do. Reprof predicts sheets instead of helices in several regions. The residues of the beta-barell sheets (the tube in the middle of <xr id="fig:ss_P04062"/>) are marked by asterisks.
40 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKG Reprof CCCCCCCCCCCEEEEEEECCEECCCCCCCCCCCCCEEEEEEECCCCCEEEEECCCEECCCCCCEEEEEECCCCEEEEEEC PSIPRED CCCCCCCCCCCCCEEEEECCHHCCCCCCCCCCCCCEEEEEEECCCCCCHHCCCCCCCCCCCCCCCEEEECCCCCCEEEEE DSSP CECCCEEECCCCCEEEEEECCCCCECCCCCCCCCCEEEEEEEECCCCCCEEEEEECECCCCCCCEEEEEEEEEEEEECCE 120 FGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPL Reprof CCCCCCHHHHHEEEEECCCCCCEEEEEEECCCCCEEEEEEECCCCCCEEEEEEEECCCCCCCEEEEECCCCCCCCCEEEE PSIPRED EEEEHHHHHHHHHHHCCHHHHHHHHHHHCCCCCCEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHH DSSP EEEECCHHHHHHHCCCCHHHHHHHHHHHHCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHCCHHHH 200 IHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL Reprof EEHHHHHCCCCCEEEECCCCCCCEEEECCCECCCEECCCCCCCCCCHHHHHHHHHHHHHCCCCEEEEEEEEECCCCCCCE PSIPRED HHHHHHHHCCCEEEEEEECCCCHHEEECCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHCCCEEEEEEECCCCCCCC DSSP HHHHHHHCCCCCEEEEEECCCCHHHECCCCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCEEECCCCCCHHH ****** *** 280 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPA Reprof ECCCCEEEECCCCCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEECCCEEEEEECCCCCCEEEEEEEEEEEEEECCCC PSIPRED CCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCEEEEEECCCCCCHHHHHHHHHCCHHHHHHCCEEEEECCCCCCCCH DSSP CCCCCCCCCECCHHHHHHHHHHCHHHHHHCCCCCCCEEEEEEEEHHHCCHHHHHHHCCHHHHCCCCEEEEEEECCCCCCH ******** ******* 360 KATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDS Reprof CCECCCCCECCCCCEEEECCCCCCCEEEEEEEEECCCCCCCEEECEHHHHEEEEEEECCCCEEEECCCCCCEEEEECCCC PSIPRED HHHHHHHHHHCCCCEEEEECCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHEEEEEEEEECCCCCCCCCCCCCCC DSSP HHHHHHHHHHCCCCEEEEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCEEEEEEEECCECCCCCCCCCCCCCCC ******** ******** 440 PIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL Reprof CEEEEECCCCCCCCCEEEECCCEEEECCCCCEEEEEEEECCCCCEEEEEECCCCCEEEEEEECCCCCCEEEECCCCEEEE PSIPRED CEEEECCCCEEEECHHHHHHHHHHHHCCCCCEEEEEECCCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCEEE DSSP CEEEEHHHCEEEECHHHHHHHHHHCCCCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCEEE 520 ETISPGYSIHTYLWRRQ Reprof EEECCCCEEEEEEEECC PSIPRED EEECCCCEEEEEEEECC DSSP EEEECCCEEEEEEECCC
<xr id="fig:ss_P10775"/> depicts the crystal structure 1DFJ which refers to P10775. It has two domains: d1dfji_ is a repeat domain consisting of altering alpha-helices and parallel beta-sheets. d1dfje_ contains long curved antiparallel beta-sheets and three alpha-helices. The alternating HHH and EEE regions in the following secondary structure annotations suit well with repetitive structure shown in <xr id="fig:ss_P10775"/>. Again, the PSIPRED predictions better match the DSSP assignments than Reprof.
1 MNLDIHCEQLSDARWTELLPLLQQYEVVRLDDCGLTEEHCKDIGSALRANPSLTELCLRTNELGDAGVHLVLQGLQSPTC Reprof CCCCECHHCCCCCHHHHHHHHHHHCCEEEECCCCCCHHHHHHHHHHHHCCCCHHHHHHHHCCCCCCCHEEEHCCCCCCCC PSIPRED CEEECCCCCCCHHHHHHHHHHHCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCCHHHHHHHHHHHHHCCC DSSP CECCCECCCCCHHHHHHHHHHHCCCCEEECECCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCHHHHHHHHHHHHCCCCC 81 KIQKLSLQNCSLTEAGCGVLPSTLRSLPTLRELHLSDNPLGDAGLRLLCEGLLDPQCHLEKLQLEYCRLTAASCEPLASV Reprof EEEEECCCCCCCCHCCCCCCHHHHCHCHHHHHHCCCCCCCCHHHHHHHHHCCCCCHCCHHHHHHHHHHCCCCCCHHHHHH PSIPRED CCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHH DSSP CCCEEECCCCCCCCCHHHHHHHHHHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHH 161 LRATRALKELTVSNNDIGEAGARVLGQGLADSACQLETLRLENCGLTPANCKDLCGIVASQASLRELDLGSNGLGDAGIA Reprof HHHHHHHHHHCCCCCCHHHHHHHHHCCCCCCHHHHHHHHHHCCCCCCCCCHHHHHHHHHCHCCHHHCCCCCCCCCHHHHH PSIPRED HHHCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHH DSSP HHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCHHHHHH 241 ELCPGLLSPASRLKTLWLWECDITASGCRDLCRVLQAKETLKELSLAGNKLGDEGARLLCESLLQPGCQLESLWVKSCSL Reprof HHCCCCCCCHHHHCHHEEEHCCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCCHHHHHHHHHHCCCCCCHHHHHHHHCHH PSIPRED HHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCC DSSP HHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCC 321 TAACCQHVSLMLTQNKHLLELQLSSNKLGDSGIQELCQALSQPGTTLRVLCLGDCEVTNSGCSSLASLLLANRSLRELDL Reprof HHHHHHHHHHHHHCCHHHHHHHCCCCCCCCHHHHHHHHHHCCCCCEEEEEEECCCCCCCCCHHHHHHHHHHHCCHHHHCC PSIPRED CHHHHHHHHHHHHHCCCCCEEECCCCCCCHHHHHHHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEEC DSSP EHHHHHHHHHHHHHCCCCCEEECCCCECHHHHHHHHHHHHHCCCCCCCEEECCCCCCEHHHHHHHHHHHHHCCCCCEEEC 401 SNNCVGDPGVLQLLGSLEQPGCALEQLVLYDTYWTEEVEDRLQALEGSKPGLRVIS Reprof CCCCCCCHHHHHHHCCCCCCCCHHHHHHHCCCCCCHHHHHHHHHHHCCCCCCCECC PSIPRED CCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCEECC DSSP CCCECCHHHHHHHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCEEEC
<xr id="fig:ss_Q9X0E6"/> depicts the d1kr4a_ domain of 1KR4 which is made of three alpha-helices interrupted by beta-sheets. Reprof predicts too long helices.
2 ILVYSTFPNEEKALEIGRKLLEKRLIACFNAFEIRSGYWWKGEIVQDKEWAAIFKTTEEKEKELYEELRKLHPYETPAIF Reprof EEEEECCCCHHHHHHHHHHHHHHHHHHHHCHCHHHCCCEEECEEECCHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCHHE PSIPRED EEEEEECCCHHHHHHHHHHHHHCCCEEEEEEEEEEEEEEECCCEEEEEEEEEEECCCHHHHHHHHHHHHHHCCCCCCEEE DSSP EEEEEEECCHHHHHHHHHHHHHCCCCCEEEEEEEEEEEEECCEEEEEEEEEEEEEEEHHHHHHHHHHHHHHCCCCCCCEE 82 TLKVENVLTEYMNWLRESVL Reprof HHHHHHHHHHHHHHHHHHCC PSIPRED EEECCCCCHHHHHHHHHHCC DSSP EECCCCEEHHHHHHHHHHCC
Q08209 contains the domain d1auib_ and d1auia_ which are mainly assembled of alpha-helices (<xr id="fig:ss_Q08209"/>). PSIPRED predicts these alpha-helices considerably better than Reprof which suggests beta-sheets in some regions.
14 TDRVVKAVPFPPSHRLTAKEVFDNDGKPRVDILKAHLMKEGRLEESVALRIITEGASILRQEKNLLDIDAPVTVCGDIHG Reprof CCCEEEEECCCCCCCEEEEEEECCCCCCEEEEEHHHECCCCCCCEEEEEEEEECCCCEEECCCCCCCCCCCEEEEECCCC PSIPRED CCCCCCCCCCCCCCCCCHHHCCCCCCCCCHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHCCCEEEECCCEEEECCCCC DSSP CCCCCCCCCCCCCCCECHHHHECCCCCECHHHHHHHHHCCCCECHHHHHHHHHHHHHHHHCCCCEEEECCCEEEECCCCC 94 QFFDLMKLFEVGGSPANTRYLFLGDYVDRGYFSIECVLYLWALKILYPKTLFLLRGNHECRHLTEYFTFKQECKIKYSER Reprof HHHHHHEEEEECCCCCCCEEEEEEEECCCCEEEEEEEHHHHHHHCCCCCEEEEEECCCCCCEEEEEEEEEEEEEEEEECH PSIPRED HHHHHHHHHHHCCCCCCCCEEECCCCCCCCCCCHHHHHHHHHHHHCCCCCEEEECCCCHHHHHHCCCCHHHHHHHHCCHH DSSP CHHHHHHHHHHHCCCCCCCEEECCCCCCCCCCHHHHHHHHHHHHHHCCCCEEECCCCCCCHHHHHHCCHHHHHHHHCCHH 174 VYDACMDAFDCLPLAALMNQQFLCVHGGLSPEINTLDDIRKLDRFKEPPAYGPMCDILWSDPLEDFGNEKTQEHFTHNTV Reprof HHHHHHHHCCCCCHHHHHCCCEEEEECCCCCCCCCHHHHHHHHCCCCCCCCCCCEEEEECCCCCCCCCCCCCCECCCCCE PSIPRED HHHHHHHHHCCCHHHHHCCCCEEEECCCCCCCCCCHHHHHHCCCCCCCCCCCHHHHHHCCCCCCCCCCCCCCCCCCCCCC DSSP HHHHHHHHHCCCCCEEEECCCEEEECCCCCCCCCCHHHHHHCCCCCCCCCCCHHHHHHHCEECCCCCCCCCCCCEEECCC 254 RGCSYFYSYPAVCEFLQHNNLLSILRAHEAQDAGYRMYRKSQTTGFPSLITIFSAPNYLDVYNNKAAVLKYENNVMNIRQ Reprof CCEEEEECCCCEEEEHCCCCHHHHEHHHCCCCCCEEEEEECCCCCCCEEEEEEECCCEEEEECCCEEEEEECCCEEEEEE PSIPRED CCCCCCCCHHHHHHHHHHCCCCEEEEHHHHHHHHHHHHHCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEECCCCCEEEE DSSP CCCCEEECHHHHHHHHHHCCCCEEEECCCCCCCCEEECCECCCCCCECEEEECCCCCHHHCCCCCEEEEEEECCEEEEEE 334 FNCSPHPYWLPNFMDVFTWSLPFVGEKVTEMLVNVLNICSSFEEAKGLDRINERMPPR Reprof ECCCCCCCCCCCCCEEEEEECCCCCHHHHHHHHHHHEECCEHHHHCCCCHHCCCCCCC PSIPRED ECCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCCCHHHHHHHHHHHHCCCCC DSSP ECCCCCCCCCHHHCCHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHCCCCC
We compared the prediction performance of PSIPRED and Reprof via the Q3 score and the precision of the three secondary structure states H,E, and C. The Q3 score is identical to the accuracy, i.e. the number of correctly predicted states divided by the length of the protein. The precision of state X is the fraction of correct predictions of X, formally precision(X)=TP(X)/(TP(X)+FP(X)). <xr id="ss_acc"/> shows the results: PSIPRED clearly outperforms Reprof in all for cases. PSIPRED achieves an average accuracy of 87% which is significantly higher than 58% in case of Reprof. <figtable id="ss_acc">
|Method||Q3||Precision H||Precision E||Precision C|
Disordered regions are regions with a varying three-dimensional structure. Nevertheless, such regions can be functionally highly important: regulation, signalling, and flexible ligand binding are only some examples. DisProt<ref name="DisProt">Vucetic S, Obradovic Z, Vacic V, et al. (2005). "DisProt: a database of protein disorder". Bioinformatics</ref> is a curated databases of proteins with experimentally determined disordered regions. IUPred<ref name="IUPred">Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon (2005). "The Pairwise Energy Content Estimated from Amino Acid Composition Discriminates between Folded and Intrinsically Unstructured Proteins". J. Mol. Biol.</ref> is a method for predicting disordered regions ab-initio, i.e. based solely on the protein sequence. We compared the predictions of IUPred with the annotations in the DisProt database for all four example proteins. IUPred was called to predict long, global disorder regions (confer the protocol for details). Residues involved in disordered regions were defined as those with a probability of at least 50%. These residues were compared to the DisProt annotations: either by the UniProt entry itself if available, or by a significant homolog (e-value < 1e-3) for which a DisProt entry existed. We measured the performance of IUPred via the precision (TP/(TP+FP)), sensitivity (TP/(TP+FN)), and specificity (TN/(TN+FP)).
Neither Glycosylceramidase (P04062) nor a homologous protein is annotated in DisProt. This might be due to lacking experimental data or, which is more likely, due to lacking disordered regions. The latter assumption is supported by the highly structured protein complex 1OGS (<xr id="fig:ss_P04062"/>). However, IUOred predicts some disordered residues with a probability >= 50%.
|IUPred||2, 3, 6, 90-93, 229-231, 235, 236|
|Precision: 0%||Sensitivity: undef||Specificity: 98%|
P10775 is not annotated in DisProt itself, but there is a significant homolog (DP00554) with a disordered region from 31 to 50. This region is, however, is not covered by the pairwise alignment of P10775 and DP00554. Hence, IUpred correctly did not predict any disordered region (Specificity=100%).
There is neither an entry in DisProt which suggests a disordered region in Q9X0E6 and nor does IUPred predict such a region.
Five disordered regions are annotated in DisProt for Q08209. These regions mainly cover the C-terminal end which exhibits several rather arbitrarily arranged alpha-helices (<xr id="fig:ss_Q08209"/>). All residues predicted by IUPred with a probability >= 50% are covered by DisProt annotations (precision 100%), but IUpred predicts only about half of all disordered regions (sensitivity 52%).
PolyPhobius was used to predict transmembrane helices for our protein P04062 and other three proteins P35462, Q9YDF8 and P47863. The scripts which were used to do the prediction can be found here [protocal]. The prediction results from PolyPhobius were then compared with the membrane assignment of the structures for these proteins in OPM and/or PDBTM. For that purpose, a corresponding pdb structure for each protein was needed and was taken from uniprot(<xr id="tab:ss_sequences_trans"/>):
|D(3) dopamine receptor||P35462||3PBL|
|DVoltage-gated potassium channel||Q9YDF8||1ORQ 1ORS|
Table 6: The proteins used to predict transmenbrane helix.
Our protein Glucosylceramidase is located in the lysosome, therefore contains no transmembrane regions. As expected, PolyPhobius has not reported any transmembrane regions. Instead, the most regions were predicted lying in the non-cytoplam and a signal peptide was found. Compared with that in uniprot, PolyPhobius has returned correct prediction (<xr id="tab:P04062_trans"/>):
|PolyPhobius||1-39||40-536 NON CYTOPLASMIC|
Table 7: Transmenbrane helix prediction results for P04062.
For the protein "dopamine receptor( P35462, )", only one pdb structure(3PBL) was found in uniprot. All three prediction tools PolyPhobius, OPM and PDBTM returned 7 regions of transmenbrane helix, which were confirmed by uniport. In <xr id="tab:P35462_trans"/> we can see that even the start and stop position of such regions showed very little deviation (maximally 4 bases). Figure 6 and 7 showed us the visualization of transmenbrane helix in OPM and PDBTM.
|P35462||TRANSMEM 1||TRANSMEM 2||TRANSMEM 3||TRANSMEM 4||TRANSMEM 5||TRANSMEM 6||TRANSMEM 7|
Table 8: Transmenbrane helix prediction results for P35462.
For the protein KvAP("Voltage-gated potassium channel,Q9YDF8)", since no single pdb structure was found to cover the whole sequence, two pdb structures (1ORQ,1ORS) were used. Not like what happened to the previous protein, the prediction results varied among different prediction tools(<xr id="tab:Q9YDF8_trans"/>).
PolyPhobius returned the closest prediction as compared to that from uniprot. It reported one transmenbrane region less. By using two pdb structures, OPM has returned all the transmenbrane regions as those listed in uniprot, however the most of them showed quited big deviation of the start and stop positions. Although PolyPhobius reported one transmenbrane region less, the most showed much more accurate prediction of position. In PDBTM, even two pdb structures were used, still two regions were missed.
|Q9YDF8||TRANSMEM 1||TRANSMEM 2||TRANSMEM 3||TRANSMEM 4||TRANSMEM 5||TRANSMEM 6||TRANSMEM 7||TRANSMEM 8|
|Uniprot||39–63||68–92||97–105(Intramembrane)||109–125||129–145||160–184||196–208(Intramembrane)||222 – 253|
Table 10: Transmenbrane helix prediction results for Q9YDF8.
For the protein Aquaporin-4 ("Mercurial-insensitive water channel,P47863 )",the pdb structures 2D57 from X-ray with relative higher resolution was chosen. The prediction results did not vary too much among different prediction tools(<xr id="tab:P47863_trans"/>).
Not like PolyPhobius and that stated in uniprot, OPM reported two additional transmenbrane regions. PDBTM also returned two similar regions which were recognized as "Membrane loop". At transmembrane region 2, all three prediction tools showed an at least 5 bases leftward shift.
|PDBTM||39-55||72-89||95-106 (Membrane loop)||116-133||158-177||188-205||209-222 (Membrane loop)||231-248|
Table 12: Transmenbrane helix prediction results for P47863.
The online server of SignalP with version 4.0 was used to predict signal peptides for our protein P04062 and other three proteins P02768, P47863 and P11279(<xr id="tab:ss_sequences_signa"/>): We also used Polyphobius and the information from uniprot and Signal Peptide Website to compare and validate the prediction results.
SignalP used three different scores, C, S and Y. Two additional scores, S-mean and the D-score, are reported in the output. The algorithm of SignalP employed two different neural networks, one for predicting the actual signal peptide and one for predicting the position of the signal peptidase cleavage site.
- The S-score is reported for every single amino acid position. If S-score is high, it suggests that the corresponding amino acid is part of a signal peptide. On the other hand, a low score indicate that the amino acid is part of a mature protein.
- The C-score is so called cleavage site score which should only be significantly high at the cleavage site.
- The Y-max is calculated by combining the C-score with the S-score and produces a better cleavage site prediction than the raw C-score alone.
- The S-mean is the average of the S-score for the length of the predicted signal peptide.
- The D-score is calculated as a weighted average of the S-mean and the Y-max scores.
Observing the above scores will help us to locate the signal peptides.
|Lysosome-associated membrane glycoprotein 1||P11279||-|
Table 11: The proteins used to predict signal peptides.
By using the standard option, SignalP did not return a positive result for our protein Glucosylceramidase P04062, even the s score and c score were quite high. After lowering the Cutoff value to 0.35, SignalP returned a finding of signal peptide which was validated by other tools and database(<xr id="tab:ss_sinal_p04062"/>).
# Measure Position Value Cutoff signal peptide? max. C 40 0.305 max. Y 40 0.396 max. S 36 0.684 mean S 1-39 0.323 D 1-39 0.367 0.350 YES Name=sp_P04062_GLCM_HUMAN SP='YES' Cleavage site between pos. 39 and 40: ASG-AR D=0.367 D-cutoff=0.350 Networks=SignalP-TM
|Signal Peptide Website||1-39|
|signal peptide sequence||MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASG|
Table 12: Signal peptides prediction of P04062.
Figure 15: signal peptides of Glucosylceramidase (P04062) showing in Signal Peptide Website.
For the protein Serum albumin (P02768), SignalP returned positive prediction which was validated by using other tools.
# Measure Position Value Cutoff signal peptide? max. C 19 0.710 max. Y 19 0.798 max. S 2 0.929 mean S 1-18 0.890 D 1-18 0.848 0.450 YES Name=sp_P02768_ALBU_HUMAN SP='YES' Cleavage site between pos. 18 and 19: AYS-RG D=0.848 D-cutoff=0.450 Networks=SignalP-noTM
|Signal Peptide Website||1-18|
|signal peptide sequence||MKWVTFISLLFLFSSAYS|
Table 13: Signal peptides prediction of P02768.
Figure 17: signal peptides of Serum albumin (P02768) showing in Signal Peptide Website.
For the protein Aquaporin-4 P47863, SignalP returned very low scores for which it was not meaningful to lower the cutoff value. All the other tools reported no finding either (<xr id="tab:ss_sinal_P47863"/>).
# Measure Position Value Cutoff signal peptide? max. C 41 0.209 max. Y 41 0.164 max. S 56 0.183 mean S 1-40 0.139 D 1-40 0.154 0.500 NO Name=sp_P47863_AQP4_RAT SP='NO' D=0.154 D-cutoff=0.500 Networks=SignalP-TM
|Signal Peptide Website||-|
|signal peptide sequence||-|
Table 14: Signal peptides prediction of P47863.
For the protein Lysosome-associated membrane glycoprotein 1 (P11279), as expected from its name, a long signal peptide was found with very high score:
# Measure Position Value Cutoff signal peptide? max. C 29 0.910 max. Y 29 0.939 max. S 21 0.996 mean S 1-28 0.962 D 1-28 0.952 0.450 YES Name=sp_P11279_LAMP1_HUMAN SP='YES' Cleavage site between pos. 28 and 29: ASA-AM D=0.952 D-cutoff=0.450 Networks=SignalP-noTM
|Signal Peptide Website||1-28|
|signal peptide sequence||MAAPGSARRPLLLLLLLLLLGLMHCASA|
Table 15: Signal peptides prediction of P11279.
Figure 20: signal peptides of Lysosome-associated membrane glycoprotein 1 (P11279) showing in Signal Peptide Website.
GOPET was used to predict GO terms for our protein Glucosylceramidase (P04062). The results was shown in <xr id="tab:ss_gopet"/> where only three GO terms were returned. Resetting the search options like lower the Confidence threshold did not change the prediction results.
All three GO terms belong to "Molecular Function Ontology" and are all involved in "hydrolase" or "glycosyl bonds". Since we have already known the function of our protein Glucosylceramidase (P04062) is to cleaves the glucosidic bonds of glucosylceramide and synthetic beta-glucosides. All these three predicted GO terms are correct. However, if compared with the GO terms list from uniprot(QuickGo) or the prediction from ProtFun which is shown below, GOPET returned quite fewer results. It implies that GOPET prefers to report only the GO terms with high confidence.
|GOid||Aspect||Confidence||GO term||validate in QuickGo|
|GO:0016798||F||97%||hydrolase activity acting on glycosyl bonds||yes|
Table 16: GO terms prediction results for P04062 in GOPET.
ProtFun2.0 was also used to predict GO terms for our protein Glucosylceramidase (P04062). The results were listed in <xr id="tab:ss_protfun"/>. In the table, for each predicted terms, there are two scores. The first one is the estimated probability that the entry belongs to the class.It is influenced by the prior probability of that class. The second number represents the odds that the sequence belongs to that class/category. It is independent of the prior probability.
Compared with GOPET which reported only GO identifier, ProtFun returned much more information about the input protein:
At first, the input protein was classified into 12 different functional categories based on the scheme developed by Monica Riley for E. coli in 1993<ref name="protfun_scheme">Monica Riley (1993). "Functions of the gene products of Escherichia coli.". Microbiol Rev</ref>. As we can see in the table under "Functional category" section where all the functional categories were listed with their probabilities and odds. For our protein Glucosylceramidase (P04062), "Cell_envelope" in green was considered by ProrFun as the most possible functional category that the protein should belong to.
At the second part, the protein was classified as enzyme or non-enzyme where our protein was correctly predicted as Enzyme.
At the third part,the protein was further classified into 6 different enzyme classes and possible EC numbers were also given. We can see that in that section, ProtFun did not return a significant prediction. No one class from them had a significant higher probability and odds than the others. Even though, Hydrolase (EC 3.-.-.-) with the highest probability 0.272 gave us a good suggestion since we have already known that our protein has EC=126.96.36.199.
At the last part, we can see different Gene Ontology categories with the prediction scores. The Gene Ontology categories listed here seemed different with that from the Gene Ontology. For our protein, ProtFun returned that its most possible Gene Ontology is "Immune_response" which stayed in question. Since there are no further information available about that in ProtFun, it was not easy to judge that prediction.
|Oxidoreductase (EC 1.-.-.-)||0.083||0.399|
|Transferase (EC 2.-.-.-)||0.228||0.660|
|Hydrolase (EC 3.-.-.-)||0.272||0.859|
|Lyase (EC 4.-.-.-)||0.045||0.961|
|Isomerase (EC 5.-.-.-)||0.011||0.345|
|Ligase (EC 6.-.-.-)||0.017||0.332|
|Gene Ontology category||Prob||Odds|
Table 17: GO terms prediction results for P04062 in Protfun2.0.
At last, Pfam was used to find pfam family for our protein Glucosylceramidase (P04062). As we can see in the following figure, one domain was found by Pfam: Glycoside hydrolase family 30.
It is a group of enzymes that hydrolyse the glycosidic bond between two or more carbohydrates, or between a carbohydrate and a non-carbohydrate moiety. A classification system for glycoside hydrolases, based on sequence similarity, has led to the definition of >100 different families. One can find this system on the CAZy web site.
One of the famous enzyme under this group is the mammalian Glucosylceramidase (P04062) which is the protein we are working with. It cleaves the glucosidic bonds of glucosylceramide and synthetic beta-glucosides.Any one of over 50 different mutations in the gene of glucocerebrosidase have been found to affect activity of this hydrolase, producing variants of Gaucher disease, the most prevalent lysosomal storage disease.