Sequence-based analyses Gaucher Disease
Contents
Secondary structure
Knowing the secondary structure of a protein can shed light on its function since structure implies function. If the structure of a protein is known, secondary structure elements (helix, sheet, coiled) can be assigned to its residues depending on their affinity to form hydrogen bonds. DSSP <ref name="DSSP">Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers</ref> is the most common method to perform such secondary structure assignments. If the structure of a protein is unknown, secondary structure elements be be predicted by tools like PSIPRED <ref name="PSIPRED">Liam J. McGuffin, Kevin Bryson, and David T. Jones (2000). "The PSIPRED protein structure prediction server". Bioinformatics</ref> or Reprof<ref name="Reprof">B Rost, C Sander (1993). "Prediction of protein secondary structure at better than 70% accuracy". J. Mol. Bio.</ref>. The aim of this task was to analyse the secondary structure of different proteins and the compare the secondary structure predictions of PSIPRED and Reprof with the DSSP secondary structure assignments. Following sequences were taken into account: <figtable id="tab:ss_sequences">
NAME | UniProtKB | PDB |
---|---|---|
Glucosylceramidase | P04062 | 1OGS |
Ribonuclease inhibitor | P10775 | 1DFJ |
Divalent-cation tolerance protein CutA | Q9X0E6 | 1KR4 |
Serine/threonine-protein phosphatase | Q08209 | 1AUI |
</figtable> Information about program calls and implementation details can be found in our protocol.
Predictions
For being able to better compare the different output formats, we mapped the secondary structure definitions of all three methods onto the three letters H (helix), E (sheet), and C (coiled) according to table <xr id="tab:ss_mapping"/>. Regions of the UniProt sequences which were not present in the PDB file as well as regions where no DSSP assignment was possible were ignored. <figtable id="tab:ss_mapping">
Method | H | E | C |
---|---|---|---|
DSSP | H,G,I | E,B | T,S,' ' |
PSIPRED | H | E | C |
Reprof | H | E | L |
</figtable>
P04062
Glycosylcermidase (P04062) is located the the membrane of lysosomes. It exhibits two domains which belong to the (1) glycosyl hydrolase domain fold and (2) the TIM beta/alpha-barrel fold. Both domains have hydrophobic beta sheets which anchor the protein in the membrane. <xr id="fig:ss_P04062"/> depicts the secondary structure elements of the corresponding crystal structure which coincide with the DSSP assignments. The following section shows the secondary structure annotations of the different methods: The PSIPRED predictions better coincide with the DSSP assigments than the Reprof predictions do. Reprof predicts sheets instead of helices in several regions. The residues of the beta-barell sheets (the tube in the middle of <xr id="fig:ss_P04062"/>) are marked by asterisks.
</figure>
40 ARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKG Reprof CCCCCCCCCCCEEEEEEECCEECCCCCCCCCCCCCEEEEEEECCCCCEEEEECCCEECCCCCCEEEEEECCCCEEEEEEC PSIPRED CCCCCCCCCCCCCEEEEECCHHCCCCCCCCCCCCCEEEEEEECCCCCCHHCCCCCCCCCCCCCCCEEEECCCCCCEEEEE DSSP CECCCEEECCCCCEEEEEECCCCCECCCCCCCCCCEEEEEEEECCCCCCEEEEEECECCCCCCCEEEEEEEEEEEEECCE 120 FGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPL Reprof CCCCCCHHHHHEEEEECCCCCCEEEEEEECCCCCEEEEEEECCCCCCEEEEEEEECCCCCCCEEEEECCCCCCCCCEEEE PSIPRED EEEEHHHHHHHHHHHCCHHHHHHHHHHHCCCCCCEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHH DSSP EEEECCHHHHHHHCCCCHHHHHHHHHHHHCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHCCHHHH 200 IHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGL Reprof EEHHHHHCCCCCEEEECCCCCCCEEEECCCECCCEECCCCCCCCCCHHHHHHHHHHHHHCCCCEEEEEEEEECCCCCCCE PSIPRED HHHHHHHHCCCEEEEEEECCCCHHEEECCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHCCCEEEEEEECCCCCCCC DSSP HHHHHHHCCCCCEEEEEECCCCHHHECCCCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCEEECCCCCCHHH ****** *** 280 LSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPA Reprof ECCCCEEEECCCCCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEECCCEEEEEECCCCCCEEEEEEEEEEEEEECCCC PSIPRED CCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCCEEEEEECCCCCCHHHHHHHHHCCHHHHHHCCEEEEECCCCCCCCH DSSP CCCCCCCCCECCHHHHHHHHHHCHHHHHHCCCCCCCEEEEEEEEHHHCCHHHHHHHCCHHHHCCCCEEEEEEECCCCCCH ******** ******* 360 KATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDS Reprof CCECCCCCECCCCCEEEECCCCCCCEEEEEEEEECCCCCCCEEECEHHHHEEEEEEECCCCEEEECCCCCCEEEEECCCC PSIPRED HHHHHHHHHHCCCCEEEEECCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHEEEEEEEEECCCCCCCCCCCCCCC DSSP HHHHHHHHHHCCCCEEEEEEEECCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCEEEEEEEECCECCCCCCCCCCCCCCC ******** ******** 440 PIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFL Reprof CEEEEECCCCCCCCCEEEECCCEEEECCCCCEEEEEEEECCCCCEEEEEECCCCCEEEEEEECCCCCCEEEECCCCEEEE PSIPRED CEEEECCCCEEEECHHHHHHHHHHHHCCCCCEEEEEECCCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCEEE DSSP CEEEEHHHCEEEECHHHHHHHHHHCCCCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCCCEEEEEEECCCEEE 520 ETISPGYSIHTYLWRRQ Reprof EEECCCCEEEEEEEECC PSIPRED EEECCCCEEEEEEEECC DSSP EEEECCCEEEEEEECCC |
<figure id="fig:ss_P04062"> |
P10775
<xr id="fig:ss_P10775"/> depicts the crystal structure 1DFJ which refers to P10775. It has two domains: d1dfji_ is a repeat domain consisting of altering alpha-helices and parallel beta-sheets. d1dfje_ contains long curved antiparallel beta-sheets and three alpha-helices. The alternating HHH and EEE regions in the following secondary structure annotations suit well with repetitive structure shown in <xr id="fig:ss_P10775"/>. Again, the PSIPRED predictions better match the DSSP assignments than Reprof.
</figure>
1 MNLDIHCEQLSDARWTELLPLLQQYEVVRLDDCGLTEEHCKDIGSALRANPSLTELCLRTNELGDAGVHLVLQGLQSPTC Reprof CCCCECHHCCCCCHHHHHHHHHHHCCEEEECCCCCCHHHHHHHHHHHHCCCCHHHHHHHHCCCCCCCHEEEHCCCCCCCC PSIPRED CEEECCCCCCCHHHHHHHHHHHCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCCHHHHHHHHHHHHHCCC DSSP CECCCECCCCCHHHHHHHHHHHCCCCEEECECCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCHHHHHHHHHHHHCCCCC 81 KIQKLSLQNCSLTEAGCGVLPSTLRSLPTLRELHLSDNPLGDAGLRLLCEGLLDPQCHLEKLQLEYCRLTAASCEPLASV Reprof EEEEECCCCCCCCHCCCCCCHHHHCHCHHHHHHCCCCCCCCHHHHHHHHHCCCCCHCCHHHHHHHHHHCCCCCCHHHHHH PSIPRED CCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHH DSSP CCCEEECCCCCCCCCHHHHHHHHHHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHH 161 LRATRALKELTVSNNDIGEAGARVLGQGLADSACQLETLRLENCGLTPANCKDLCGIVASQASLRELDLGSNGLGDAGIA Reprof HHHHHHHHHHCCCCCCHHHHHHHHHCCCCCCHHHHHHHHHHCCCCCCCCCHHHHHHHHHCHCCHHHCCCCCCCCCHHHHH PSIPRED HHHCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHH DSSP HHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCHHHHHH 241 ELCPGLLSPASRLKTLWLWECDITASGCRDLCRVLQAKETLKELSLAGNKLGDEGARLLCESLLQPGCQLESLWVKSCSL Reprof HHCCCCCCCHHHHCHHEEEHCCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCCHHHHHHHHHHCCCCCCHHHHHHHHCHH PSIPRED HHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCEEECCCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCC DSSP HHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEECCCCCCHHHHHHHHHHHHHCCCCCCCEEECCCCCC 321 TAACCQHVSLMLTQNKHLLELQLSSNKLGDSGIQELCQALSQPGTTLRVLCLGDCEVTNSGCSSLASLLLANRSLRELDL Reprof HHHHHHHHHHHHHCCHHHHHHHCCCCCCCCHHHHHHHHHHCCCCCEEEEEEECCCCCCCCCHHHHHHHHHHHCCHHHHCC PSIPRED CHHHHHHHHHHHHHCCCCCEEECCCCCCCHHHHHHHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCCEEEC DSSP EHHHHHHHHHHHHHCCCCCEEECCCCECHHHHHHHHHHHHHCCCCCCCEEECCCCCCEHHHHHHHHHHHHHCCCCCEEEC 401 SNNCVGDPGVLQLLGSLEQPGCALEQLVLYDTYWTEEVEDRLQALEGSKPGLRVIS Reprof CCCCCCCHHHHHHHCCCCCCCCHHHHHHHCCCCCCHHHHHHHHHHHCCCCCCCECC PSIPRED CCCCCCHHHHHHHHHHHHCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCEECC DSSP CCCECCHHHHHHHHHHHCCCCCCCCEEECCCCCCCHHHHHHHHHHHHHCCCCEEEC |
<figure id="fig:ss_P10775"> |
Q9X0E6
<xr id="fig:ss_Q9X0E6"/> depicts the d1kr4a_ domain of 1KR4 which is made of three alpha-helices interrupted by beta-sheets. Reprof predicts too long helices.
</figure>
2 ILVYSTFPNEEKALEIGRKLLEKRLIACFNAFEIRSGYWWKGEIVQDKEWAAIFKTTEEKEKELYEELRKLHPYETPAIF Reprof EEEEECCCCHHHHHHHHHHHHHHHHHHHHCHCHHHCCCEEECEEECCHHHHHHHCCCHHHHHHHHHHHHHCCCCCCCHHE PSIPRED EEEEEECCCHHHHHHHHHHHHHCCCEEEEEEEEEEEEEEECCCEEEEEEEEEEECCCHHHHHHHHHHHHHHCCCCCCEEE DSSP EEEEEEECCHHHHHHHHHHHHHCCCCCEEEEEEEEEEEEECCEEEEEEEEEEEEEEEHHHHHHHHHHHHHHCCCCCCCEE 82 TLKVENVLTEYMNWLRESVL Reprof HHHHHHHHHHHHHHHHHHCC PSIPRED EEECCCCCHHHHHHHHHHCC DSSP EECCCCEEHHHHHHHHHHCC |
<figure id="fig:ss_Q9X0E6"> |
Q08209
Q08209 contains the domain d1auib_ and d1auia_ which are mainly assembled of alpha-helices (<xr id="fig:ss_Q08209"/>). PSIPRED predicts these alpha-helices considerably better than Reprof which suggests beta-sheets in some regions.
</figure>
14 TDRVVKAVPFPPSHRLTAKEVFDNDGKPRVDILKAHLMKEGRLEESVALRIITEGASILRQEKNLLDIDAPVTVCGDIHG Reprof CCCEEEEECCCCCCCEEEEEEECCCCCCEEEEEHHHECCCCCCCEEEEEEEEECCCCEEECCCCCCCCCCCEEEEECCCC PSIPRED CCCCCCCCCCCCCCCCCHHHCCCCCCCCCHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHCCCEEEECCCEEEECCCCC DSSP CCCCCCCCCCCCCCCECHHHHECCCCCECHHHHHHHHHCCCCECHHHHHHHHHHHHHHHHCCCCEEEECCCEEEECCCCC 94 QFFDLMKLFEVGGSPANTRYLFLGDYVDRGYFSIECVLYLWALKILYPKTLFLLRGNHECRHLTEYFTFKQECKIKYSER Reprof HHHHHHEEEEECCCCCCCEEEEEEEECCCCEEEEEEEHHHHHHHCCCCCEEEEEECCCCCCEEEEEEEEEEEEEEEEECH PSIPRED HHHHHHHHHHHCCCCCCCCEEECCCCCCCCCCCHHHHHHHHHHHHCCCCCEEEECCCCHHHHHHCCCCHHHHHHHHCCHH DSSP CHHHHHHHHHHHCCCCCCCEEECCCCCCCCCCHHHHHHHHHHHHHHCCCCEEECCCCCCCHHHHHHCCHHHHHHHHCCHH 174 VYDACMDAFDCLPLAALMNQQFLCVHGGLSPEINTLDDIRKLDRFKEPPAYGPMCDILWSDPLEDFGNEKTQEHFTHNTV Reprof HHHHHHHHCCCCCHHHHHCCCEEEEECCCCCCCCCHHHHHHHHCCCCCCCCCCCEEEEECCCCCCCCCCCCCCECCCCCE PSIPRED HHHHHHHHHCCCHHHHHCCCCEEEECCCCCCCCCCHHHHHHCCCCCCCCCCCHHHHHHCCCCCCCCCCCCCCCCCCCCCC DSSP HHHHHHHHHCCCCCEEEECCCEEEECCCCCCCCCCHHHHHHCCCCCCCCCCCHHHHHHHCEECCCCCCCCCCCCEEECCC 254 RGCSYFYSYPAVCEFLQHNNLLSILRAHEAQDAGYRMYRKSQTTGFPSLITIFSAPNYLDVYNNKAAVLKYENNVMNIRQ Reprof CCEEEEECCCCEEEEHCCCCHHHHEHHHCCCCCCEEEEEECCCCCCCEEEEEEECCCEEEEECCCEEEEEECCCEEEEEE PSIPRED CCCCCCCCHHHHHHHHHHCCCCEEEEHHHHHHHHHHHHHCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEECCCCCEEEE DSSP CCCCEEECHHHHHHHHHHCCCCEEEECCCCCCCCEEECCECCCCCCECEEEECCCCCHHHCCCCCEEEEEEECCEEEEEE 334 FNCSPHPYWLPNFMDVFTWSLPFVGEKVTEMLVNVLNICSSFEEAKGLDRINERMPPR Reprof ECCCCCCCCCCCCCEEEEEECCCCCHHHHHHHHHHHEECCEHHHHCCCCHHCCCCCCC PSIPRED ECCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCCCHHHHHHHHHHHHCCCCC DSSP ECCCCCCCCCHHHCCHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHCCCCC |
<figure id="fig:ss_Q08209"> |
Prediction accuracy/precision
We compared the prediction performance of PSIPRED and Reprof via the Q3 score and the precision of the three secondary structure states H,E, and C. The Q3 score is identical to the accuracy, i.e. the number of correctly predicted states divided by the length of the protein. The precision of state X is the fraction of correct predictions of X, formally precision(X)=TP(X)/(TP(X)+FP(X)). <xr id="ss_acc"/> shows the results: PSIPRED clearly outperforms Reprof in all for cases. PSIPRED achieves an average accuracy of 87% which is significantly higher than 58% in case of Reprof. <figtable id="ss_acc">
Method | Q3 | Precision H | Precision E | Precision C |
---|---|---|---|---|
P04062 | ||||
PSIPRED | 0.831 | 0.830 | 0.872 | 0.810 |
Reprof | 0.553 | 1.000 | 0.455 | 0.592 |
P10775 | ||||
PSIPRED | 0.941 | 0.959 | 0.960 | 0.919 |
Reprof | 0.603 | 0.589 | 0.417 | 0.644 |
Q9X0E6 | ||||
PSIPRED | 0.890 | 1.000 | 0.895 | 0.720 |
Reprof | 0.580 | 0.562 | 0.917 | 0.458 |
Q08209 | ||||
PSIPRED | 0.833 | 0.842 | 0.902 | 0.812 |
Reprof | 0.579 | 0.762 | 0.293 | 0.743 |
</figtable>
Disorder
Disordered regions are regions with a varying three-dimensional structure. Nevertheless, such regions can be functionally highly important: regulation, signalling, and flexible ligand binding are only some examples. DisProt<ref name="DisProt">Vucetic S, Obradovic Z, Vacic V, et al. (2005). "DisProt: a database of protein disorder". Bioinformatics</ref> is a curated databases of proteins with experimentally determined disordered regions. IUPred<ref name="IUPred">Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon (2005). "The Pairwise Energy Content Estimated from Amino Acid Composition Discriminates between Folded and Intrinsically Unstructured Proteins". J. Mol. Biol.</ref> is a method for predicting disordered regions ab-initio, i.e. based solely on the protein sequence. We compared the predictions of IUPred with the annotations in the DisProt database for all four example proteins. IUPred was called to predict long, global disorder regions (confer the protocol for details). Residues involved in disordered regions were defined as those with a probability of at least 50%. These residues were compared to the DisProt annotations: either by the UniProt entry itself if available, or by a significant homolog (e-value < 1e-3) for which a DisProt entry existed. We measured the performance of IUPred via the precision (TP/(TP+FP)), sensitivity (TP/(TP+FN)), and specificity (TN/(TN+FP)).
P04062
Neither Glycosylceramidase (P04062) nor a homologous protein is annotated in DisProt. This might be due to lacking experimental data or, which is more likely, due to lacking disordered regions. The latter assumption is supported by the highly structured protein complex 1OGS (<xr id="fig:ss_P04062"/>). However, IUOred predicts some disordered residues with a probability >= 50%.
<figtable id="disorder_P04062">
Method | Disorder regions | |
---|---|---|
IUPred | 2, 3, 6, 90-93, 229-231, 235, 236 | |
DisProt | ||
Precision: 0% | Sensitivity: undef | Specificity: 98% |
</figtable>
P10775
P10775 is not annotated in DisProt itself, but there is a significant homolog (DP00554) with a disordered region (<xr id="fig:disorder_P10775"/>) from 31 to 50. This region is, however, not predicted by IUPred (probability < 30%). In fact, is is debatable whether the subsequence 31-50 (the red region in <xr id="fig:disorder_P10775_pdb"/>) is actually disordered.
</figure>
</figure>
<figtable id="disorder_P10775">
</figtable> |
<figure id="fig:disorder_P10775"> | ||||||||||||
<figure id="fig:disorder_P10775_pdb"> |
Q9X0E6
There is neither an entry in DisProt which suggests a disordered region in Q9X0E6 and nor does IUPred predict such a region.
Q08209
Five disordered regions are annotated in DisProt for Q08209. These regions mainly cover the C-terminal end which exhibits several rather arbitrarily arranged alpha-helices (<xr id="fig:ss_Q08209"/>). All residues predicted by IUPred with a probability >= 50% are covered by DisProt annotations (precision 100%), but IUpred predicts only about half of all disordered regions (sensitivity 52%).
</figure>
<figtable id="disorder_Q08209">
</figtable> |
<figure id="fig:disorder_Q08209"> |
References
<references/>