Gaucher Disease: Task 03 - Sequence-based predictions
- 1 Secondary Structure
- 1.1 Evaluation results
- 1.2 Human Glucosylceramidase (P04062)
- 1.3 Ribonuclease inhibitor (P10775)
- 1.4 Divalent-cation tolerance protein CutA (Q9X0E6)
- 1.5 Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform (Q08209)
- 1.6 Colclusion
- 2 Disorder
- 3 Transmembrane Helices
- 4 Signal Peptides
- 5 GO Terms
- 6 Discussion
In this task secondary structure of a protein is predicted using ReProf and compared to PsiPred prediction and DSSP structure assignment.
Evaluation results of Reprof against Psipred and DSSP are summarized in <xr id="secondary structure results"/>. Reprof run was performed starting with the Psi-BLAST PSSM after a run against big_80 with 3 iterations and E-value cutoff 10E-10 (as described in the lab journal in the link above).
<figtable id="secondary structure results">
|Query||Precision PsiPred||Precision DSSP|
Human Glucosylceramidase (P04062)
Aligned view of the secondary structure predictions with ReProf and PsiPred, the DSSP assignment and the UniProt annotation for the Gaucher's disease protein, P04062, is shown below.
Sequence: MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHT ReProf: LLLLLLHHHHLLHHLLHHHHHHHHHHHHHHHHHHHHHHLLLLLLEEELLLLLEEEEEELLLLLLLLLLLLLLLLEEEEEEELLLLLEEEEELLLLLLLLL PsiPred: LLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLEEEEEELLLLLLLLLLLLLLLLLLEEEEEELLLLLLLLLLLLLLLLLLL DSSP: ----------------------------------------E---EEE-LLLLEEEEEELL---E--------LLEEEEEEEELLL--LEEEEEE-ELL-- UniProt: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEELEEEELLLLLLLLLLLLLLLLLEEEEEEEELLLLLEEEEEEELEEELL
Sequence: GTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLI ReProf: LLLEEEEEELLLLEEEEEEELEEHHHHHHHHHHLLLHHHHHHHHHHHLLLLLLLEEEEEEELLLLLLLLLLLEEELLLLLLLLLLLLLLHHHHHHHHHHH PsiPred: LLLEEEEEELLLLLLEEEEEELLLLHHHHHHHHLLLHHHHHHHHHHLLLLLLLLLEEEEEELLLLLLLLLLLLLLLLLLLLLLLLLLLLHHLLLLLHHHH DSSP: --LEEEEEEEEEEEEE--EEEEE--HHHHHHHLLL-HHHHHHHHHHHHLLLLL---EEEEEEL--LLLLL---L--LLL-LL-LL----HHHHHHHHHHH UniProt: LLEEEEEEEEEEEEEELLEEEEELLHHHHHHHHLLLHHHHHHHHHHHHLLLLLLLLEEEEEEELLEEEEELLLLLLEEELLLLLLLLLLHHHHLLHHHHH
Sequence: HRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA ReProf: HHHHHHLLLLEEEEEELLLLLLEEEELLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHHLLLLEEEEELLLLLLLLLLLLLLLLLELLLHHHHHHHHH PsiPred: HHHHHHLLLLLEEEELLLLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHLLLLLLEEEELLLLLLLLLLLLLLLLLLLLHHHHHHHHH DSSP: HHHHHH-LL--EEEEEEL---HHHELL-LLLLL-EELL-LLLHHHHHHHHHHHHHHHHHHHLL---LEEEL-LLLLHHHLLL--L---E--HHHHHHHHH UniProt: HHHHHHLLLLLEEEEEEELLLHHHEEELEEEEELEEEELLLLHHHHHHHHHHHHHHHHHHHLLLLLEEEEELLLHHHHHLLLLEEELLLLLHHHHHHHHH
Sequence: RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGM ReProf: HHHHHHHHHLLLLLEEEEEEELLLLLLHHHHHHHLLLHHHHHHHHHLEEEELLLLLLLLHHHHHHHHHHLLLLLEEEEEEEELLLLLLLLLLLLLHHHHH PsiPred: HHHHHHHHLLLLLLEEEEEELLLLLLLLLLLHHHLLLHHHHHHLLEEEEELLLLLLLHHHHHHHHHHHHLLLLEEEEEELLLLLLLLLLLLLLLLHHHHH DSSP: HHHHHHHHLLLLLLLEEEEEEEEHHHLLHHHHHHHLLHHHHLL--EEEEEEELLL---HHHHHHHHHHH-LLLEEEEEEEE----LLL-L--LL-HHHHH UniProt: HLHHHHHHLLLLLLEEEEEEEEEHHHLLHHHHHHHLLHHHHLLLLEEEEEEELHHHLLHHHHHHHHHHHLLLEEEEEEEEELLLLEEELLLLLLLHHHHH
Sequence: QYSHSIITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAVALMHPDGSAVVVVL ReProf: HHHHHHHHHHHHHLHHEEEEEEEELLLLLLLLLLLLEEEEEEEELLLLEEEEELLEHHHHEEEELLLLLLEEEEEELLLLLLEEEEEEELLLLLEEEEEE PsiPred: HHHHHHHHHHHLLLEEEEEELLLLLLLLLLLLLLLLLLLLEEEELLLLEEEELLLEEEEEHHLLLLLLLLEEEEEELLLLLLLEEEEEELLLLLEEEEEE DSSP: HHHHHHHHHHHLLEEEEEEEEL-E-LLL---LL------LEEEEHHHLEEEE-HHHHHHHHHHLL--LL-EEEEEEELL--LEEEEEEE-LLL-EEEEEE UniProt: HHHHHHHHHHHLLEEEEEEEEEEELEEELLLLLLLLLLLEEEEEHHHLEEEELHHHHHHHHHHLLLLLLLEEEEEEEEELLEEEEEEEELLLLLEEEEEE
Sequence: NRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ ReProf: LLLLLLEEEEEEELLLEEEEEELLLLEEEEEEEELL PsiPred: ELLLLLEEEEEELLLLLEEEEELLLLEEEEEEEELL DSSP: E-LLL-EEEEEEELLLEEEEEEE-LLEEEEEEE--- UniProt: ELEEELEEEEEEELLLEEEEEEELLLEEEEEEELLL
Comparison to available knowledge
Here we compare the secondary structure predictions and DSSP assignment for the protein sequence P04062 to the available knowledge in UniProt and PDB.
UniProt secondary structure annotation assigns residues into one of the three states: helix, strand or turn. The annotation might be unreliable, if no evidence on experimental level is available for the protein. However, the existence of our protein, P04062, was verified on protein level, therefore we can rely on the annotation to some extent. The UniProt secondary structure annotation for P04062 is shown in the image above. It also included into the alignment in previous section, regarding both turns and positions not in one of the three states (helix, strand or turn) as loops. As one can see from the alignment, the main difference is that ReProf and PsiPred both predict one long helix and ReProf additionally two short helices before it (with 4 and 2 residues) near the beginning of the sequence, whereas UniProt annotates only loops there (and DSSP has no assignment there). But altogether, the secondary structures look very similar, excluding small disagreements in the exact position and length of a segment or not everywhere present short segments. The latter may be falsely predicted or assigned.
The PDB structure of owr protein P04062, 1OGS, consists of two identical chains, A and B. From looking at the cartoon representation colored according to the secondary structures, one can see that each chain contains many alternating helices and sheets connected by loops. Beta barrel fold can be recognized and an extra beta sheet ring on the side of each chain. This supports our predictions, the DSSP assignment and the UniProt annotation of the secondary structure of the protein P04062.
Ribonuclease inhibitor (P10775)
This is the aligned view of the secondary structure predictions with ReProf and PsiPred, the DSSP assignment and the UniProt annotation for Ribonuclease inhibitor (P10775).
Sequence: MNLDIHCEQLSDARWTELLPLLQQYEVVRLDDCGLTEEHCKDIGSALRANPSLTELCLRTNELGDAGVHLVLQGLQSPTCKIQKLSLQNCSLTEAGCGVL ReProf: LEELLLLLLLLHHHHHHHHHHHHLLLEEEELLLLLLHHHHHHHHHHHHLLLLLEEEELLLLLLLHHHHHHHHHHHHLLLHHEEEEELLLLLLLHHHHHHH PsiPred: LEEELLLLLLLHHHHHHHHHHHHHLLEEELLLLLLLHHHHHHHHHHHLLLLLLLEEELLLLLLLHHHHHHHHHHLLLLLLLLLEEEEELLLLLHHHHLHH DSSP: -E--EEL----HHHHHHHHHHHLL-LEEEEEL----HHHHHHHHHHHLL-LL--EEE--L---HHHHHHHHHHHHLLLL----EEE-LLL---HHHHHLH UniProt: LLLLEEELLLLHHHHHHHHHHHLLLEEEEEELLLLLHHHHHHHHHHHLLLLLLLEEELLLLLLHHHHHHHHHHHHLLLLLLLLEEELLLLLLLHHHHHLH
Sequence: PSTLRSLPTLRELHLSDNPLGDAGLRLLCEGLLDPQCHLEKLQLEYCRLTAASCEPLASVLRATRALKELTVSNNDIGEAGARVLGQGLADSACQLETLR ReProf: HHHHHHLLLLLHHHHHHLLLLLHHHHHHHHHHHHHHHHHHHHHHHLLLLLHHHHHHHHHHHHLLLLLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH PsiPred: HHHHLLLLLLLEEELLLLLLLLHHHHHHHHHHLLLLLLLLEEEELLLLLLHHLHHHHHHHHLLLLLLLEEELLLLLLLLHHHHHHHHHLLLLLLLLLEEE DSSP: HHHHHH-LL--EEE--L---HHHHHHHHHHHHHLLL----EEE-LL---EHHHHHHHHHHHHH-L---EEE-LLLE-HHHHHHHHHHHHHL--L---EEE UniProt: HHHHHHLLLLLEEELLLLLLHHHHHHHHHHHHHLLLLLLLEEELLLLLLLHHHHHHHHHHHHHLLLLLEEELLLLLLHHHHHHHHHHHHHLLLLLLLEEE
Sequence: LENCGLTPANCKDLCGIVASQASLRELDLGSNGLGDAGIAELCPGLLSPASRLKTLWLWECDITASGCRDLCRVLQAKETLKELSLAGNKLGDEGARLLC ReProf: HHLLLLLHHHHHHHHHHHHHLLLLLHHHHLLLLLLLHHHHHHHHHHHHHLHHHHHHLLLLLLLLHHHHHHHHHHHHHLLHHHHHHHHHHLLLHHHHHHHH PsiPred: LLLLLLLHHHHHHHHHHHLLLLLLLEEELLLLLLLLHHHHHHHHHHLLLLLLLLEEELLLLLLLHHHHHHHHHHHLLLLLLLEEELLLLLLLLHHHHHHH DSSP: -LLL---HHHHHHHHHHHHH-LL--EEE--LL--HHHHHHHHHHHHL-LL----EEE-LLL---HHHHHHHHHHHHH-LL--EEE-LLL--HHHHHHHHH UniProt: LLLLLLLHHHHHHHHHHHHHLLLLLEEELLLLLLHHHHHHHHHHHHLLLLLLLLEEELLLLLLLHHHHHHHHHHHHHLLLLLEEELLLLLLHHHHHHHHH
Sequence: ESLLQPGCQLESLWVKSCSLTAACCQHVSLMLTQNKHLLELQLSSNKLGDSGIQELCQALSQPGTTLRVLCLGDCEVTNSGCSSLASLLLANRSLRELDL ReProf: HHHHHHLLLEEEEEEELLLLLHHHHHHHHHHHHHHHHHHHHHHLLLLLLHHHHHHHHHHHHLLLLLEEEEEELLLLLLHHHHHHHHHHHHHLLLLEEEEL PsiPred: HHLLLLLLLLLEEEELLLLLLHHHHHHHHHHHLLLLLLLEEELLLLLLLLHHHHHHHHLLLLLLLLLEEEELLLLLLLHHHHHHHHHHHHLLLLLLEEEL DSSP: HHHLLLL----EEE-LLL--EHHHHHHHHHHHHH-LL--EEE--LLE-HHHHHHHHHHHLLLLL----EEE-LLL---HHHHHHHHHHHHH--L--EEE- UniProt: HHHLLLLLLLLEEELLLLLLLHHHHHHHHHHHHHLLLLLEEELLEEELHHHHHHHHHHHLLEEELLLLEEELLLLLLLHHHHHHHHHHHHHLLLLLEEEL
Sequence: SNNCVGDPGVLQLLGSLEQPGCALEQLVLYDTYWTEEVEDRLQALEGSKPGLRVIS ReProf: LLLLLLHHHHHHHHHHHHHLLLLLEEEEELLLLLLHHHHHHHHHHHHHLLLLLELL PsiPred: LLLLLLLHHHHHHHHHLLLLLLLLLEEELLLLLLLHHHHHHHHHHHHLLLLLEELL DSSP: LLLL--HHHHHHHHHHHLLLL----EEE-LL----HHHHHHHHHHHHH-LL-EEE- UniProt: LEEELLHHHHHHHHHHHLEEELLLLEEELLLLLLLHHHHHHHHHHHHHLLLLEEE
Comparison to available knowledge
The following is the comparison of the secondary structure predictions and DSSP assignment for the protein sequence P10775 to the available knowledge in UniProt and PDB.
The existence of the protein P10775 was verified on protein level,too, therefore we can rely on the annotation to some extent. The UniProt secondary structure annotation for P10775 is shown in the image above. Like before, it is also included into the secondary structure alignment in previous section. The main differences occur in ReProf in the first half of the sequence: prediction of a helix, where a 3-residue long strand is predicted by the other sources, and prediction of a longer helix after that, where a shorter helix and a short strand are predicted by the others. The first case occurs once more and the latter three more times. Overall, more helices are predicted by ReProf. In the PsiPred prediction, DSSP assignment and UniProt annotation the secondary structures look altogether very similar: a sequence of alternating helices and strands separated by loops, sometimes with two short consequent strands.
From the visualization of the PDB structure of the protein P10775, 2BNH, one can see that it has the typical hoof fold, with helices on the outer side and sheets in the inner side connected by loops from both sides. This fold supports our predictions, the DSSP assignment and the UniProt annotation of the secondary structure of the protein P10775.
Divalent-cation tolerance protein CutA (Q9X0E6)
The following is the comparison of the secondary structure predictions and DSSP assignment for the protein sequence Q9X0E6 to the available knowledge in UniProt and PDB.
Sequence: MILVYSTFPNEEKALEIGRKLLEKRLIACFNAFEIRSGYWWKGEIVQDKEWAAIFKTTEEKEKELYEELRKLHPYETPAIFTLKVENVLTEYMNWLRESV ReProf: LEEEEEELLLHHHHHHHHHHHHHHLLEEEEELLLEEEEEEELLLLEEEEEEEEEEEELHHHHHHHHHHHHHLLLLLLLLEEEEELHLLLHHHHHHHHHHL PsiPred: LEEEEELLLLHHHHHHHHHHHHHLLLLLEEEEEEEEEEEEELLLEEEEEEEEEEEELLHHLHHHHHHHHHHHLLLLLLEEEEEELLLLLHHHHHHHHHHL DSSP: -EEEEEEELLHHHHHHHHHHHHHLLL-LEEEEEEEEEEEEELLEEEEEEEEEEEEEEEHHHHHHHHHHHHHH-LLLL--EEEEE-L---HHHHHHHHHH- UniProt: EEEEEEEEEEHHHHHHHHHHHHHLLLLEEEEEEEEEEEEEELLEEEEEEEEEEEEEEEHHHHHHHHHHHHHHLEEEELLEEEELLLLLLHHHHHHHHHH
Sequence: L ReProf: L PsiPred: L DSSP: - UniProt:
Comparison to available knowledge
The existence of the protein Q9X0E6 was also verified on protein level, thus we can rely on its UniProt annotation. The UniProt secondary structure annotation for Q9X0E6 is shown in the image above. Like before, it is also included into the secondary structure alignment in previous section. It is a short protein of only 101 amino acids and all 4 sources agree almost entirely in the secondary structure prediction or assignment. The protein contains three helices (apart from a single loop predicted in the second helix in PsiPred) and 4 strands according to PsiPred and DSSP. ReProf splits the second strand by a 3-residue loop, whereas UniProt splits the last strand by 2 loop residues.
From the visualization of the PDB structure of the protein Q9X0E6, 1VHF, in Pymol we can see the number and consequence of its helices, loops and strands : lELHLELEHLELh (a lower case letter represents here one residue in this structure and an upper case letter multiple residues in this state). It is the same as the DSSP assignment of the protein P10775.
Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform (Q08209)
The following is the comparison of the secondary structure predictions and DSSP assignment for the protein sequence Q08209 to the available knowledge in UniProt and PDB.
Sequence: MSEPKAIDPKLSTTDRVVKAVPFPPSHRLTAKEVFDNDGKPRVDILKAHLMKEGRLEESVALRIITEGASILRQEKNLLDIDAPVTVCGDIHGQFFDLMK ReProf: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLHLHHHHHHHHHLLLLLLHHHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLEHHHHHH PsiPred: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHLLLLLHHHHHHHHHHHHHHHHHLLLLEEELLLEEEELLLLLHHHHHHH DSSP: ----------------LLLLL-------E-HHHHE-LLL-E-HHHHHHHHHLL--E-HHHHHHHHHHHHHHHHLL-LEEEE-LLEEEE---LL-HHHHHH UniProt: LLLLLLLLLLEEELLLLLLLLLLLLLLLLLHHHHLLLLLLLLHHHHHHHHHLLLLLLHHHHHHHHHHHHHHHHLLLEEEEELEEEEEELLLLLLHHHHHH
Sequence: LFEVGGSPANTRYLFLGDYVDRGYFSIECVLYLWALKILYPKTLFLLRGNHECRHLTEYFTFKQECKIKYSERVYDACMDAFDCLPLAALMNQQFLCVHG ReProf: HHHHLLLLLLLEEEEELEELLLLLLLHHHHHHHHHHHHHLLLHEEEEELLLLLLLLLLLLLLHHHHHHHLHHHHHHHHHHHHHHHLLHHEELLEEEEEEL PsiPred: HHHHLLLLLLLLLEELLLLLLLLLLHHHHHHHHHHHHHLLLLLEEEELLLLLLLLLLLLLLHHHHHHHHHLHHHHHHHHHHLLLLHHHHHHLLLEEEELL DSSP: HHHHH--LLL--EEE-L--LLLLL-HHHHHHHHHHHHHHLLLLEEE---LLLLHHHHHHLLHHHHHHHHL-HHHHHHHHHHHLLL--EEEELLLEEEELL UniProt: HHHHHLLLLLLLEEELLLLEEEELLHHHHHHHHHHHHHHLLLLEEELLLLLLLHHHHHHLLHHHHHHHHLLHHHHHHHHHHHHLLLLEEEELLLEEEEEE
Sequence: GLSPEINTLDDIRKLDRFKEPPAYGPMCDILWSDPLEDFGNEKTQEHFTHNTVRGCSYFYSYPAVCEFLQHNNLLSILRAHEAQDAGYRMYRKSQTTGFP ReProf: LLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLLLLLLLLLLLLELLLLLLLEEEELHHHHHHHHHHLLLEEEEEELLELLLLEEEEEELLLLLLL PsiPred: LLLLLLLLHHHHHLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEELLHHHHHHHHHHLLLLHHHHHLLLLLLLLLLEELLLLLLLL DSSP: ---LL--LHHHHHHL--LLL--LLLHHHHHHH-EE-LLLLL-LL---EEE-LLLLLLEEE-HHHHHHHHHHLL-LEEEE--L--LLLEEE--E-LLLLLE UniProt: LLLLLEEELHHHHLLLLEEELLEEEHHHHHHHLLLLLLLLLLLLLLEEEELLLLEEEEEELHHHHHHHHHHLLLLEEEELLLLLLLEEEELLLLLLLEEE
Sequence: SLITIFSAPNYLDVYNNKAAVLKYENNVMNIRQFNCSPHPYWLPNFMDVFTWSLPFVGEKVTEMLVNVLNICSDDELGSEEDGFDGATAAARKEVIRNKI ReProf: EEEEEEELLLLLLLLLLEEEEEEELLLLLEEEEEELLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLLLLHHHHHHHHHHHH PsiPred: LEEEEELLLLLLLLLLLLEEEEEEELLLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLHHLHHHHHHHHHHHH DSSP: LEEEE---LLHHHLL---EEEEEEELLEEEEEEE---------HHH--HHHHHHHHHHHHHHHHHHHHHLL----------------------------- UniProt: EEEEEELLLLHHHLLLLLEEEEEEELLEEEEEEELLLLLLLLLHHHLLHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLLLLLLLLLLLLHHHHH
Sequence: RAIGKMARVFSVLREESESVLTLKGLTPTGMLPSGVLSGGKQTLQSATVEAIEADEAIKGFSPQHKITSFEEAKGLDRINERMPPRRDAMPSDANLNSIN ReProf: HHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLLLLLLLLLLLHHHLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHHLLLLLLLLLLLLLLLLLLLL PsiPred: HHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLHHHHHHLLLLLLLLLLLLLLLLLLLLLLLLL DSSP: --------------------------------------------------------------------HHHHHHHHHHHHL------------------- UniProt: HHHHHHHHHLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHHH
Sequence: KALTSETNGTDSNGSNSSNIQ ReProf: LLLLLLLLLLLLLLLLLLLLL PsiPred: LLLLLLLLLLLLLLLLLLLLL DSSP: -------------------- UniProt:
Comparison to available knowledge
Also the last protein we explore was verified on protein level, therefore we can trust the UniProt annotation to some extent. The UniProt secondary structure annotation for Q08209 is shown in the image above, which is also included into the secondary structure alignment in previous section. It is a long protein of over 500 residues and some disagreements about its secondary structure assignment can be seen. According to all four sources, the protein contains many loops and helices and some strands. Like in P04062, there are some disagreements in the exact position and length of a segment or short segments not present everywhere. The main differences are:
- A 6-residue long helix is assigned by DSSP and UniProt, but not by the predictors.
- ReProf does not find a 4-6 residues long helix present at the three other sources.
- ReProf predicts a 5-residues long strand instead of a 7-residues long helix assigned by DSSP and UniProt.
- PsiPred predicts only a loop instead of the latter helix.
- PsiPred predicts a helix, where a 4-6 residue long strand is assigned by the other sources.
- Near the end of the protein (bevore the last "conserved" helix), RPsiPred predicts a long helix, ReProf only a 3-residues long helix, whereas only lopps are assigned by DSSP and UniProt in that region.
From the visualization of the PDB structure of the protein Q08209, 1AUI, one can see that it contains many helices connected by loop regions and sometimes by short strands, also there is a beta-sheet region consisting of ten strands. This supports our predictions, the DSSP assignment and the UniProt annotation of the secondary structure of the protein Q08209.
Using the secondary structure predictions ReProf and PsiPred, DSSP assignments and UniProt annotations, we could learn for our protein that its secondary structure mainly contains helices and strands. It is a dimer, each chain folds into a beta-barrel domain. Also about the example proteins we could learn about their secondary structures (already discussed in the respective sections).
To sum up, ReProf and PsiPred are good secondary structure prediction tool. Their predictions agree in the most cases with the DSSP assignment - typically used as a reference, because it uses the actually resolved PDB structure of a protein, which could also be seen in the visualization of the structures - and the UniProt annotation of the secondary structure.
In this task we predict protein disordered and globular regions using IUpred and MetaDisorder.
IUpred is a protein disorder predictor. User can choose one of the three options:
- long for prediction of long disorders
- short for prediction of short disorders
- glob for prediction of structured, globular domains
IUpred prediction results for each protein are presented and described in the plots below. The disorder tendency ranges from 0 to 1 and is plotted for each residue in a protein sequence. Residues with a tendency above 0.5 are seen as disordered.
IUpred results for protein P04062. There is almost no difference between the "long" and "short" prediction, the latter only predicts more disorder at the beginning and the end of the protein. Almost the whole protein sequence - from position 4 till the end (536) - is in a globular domain, according to "glob". Only a short region of three first residues is predicted to be disordered.
IUpred results for protein P10775. There is almost no difference between the "long" and "short" prediction, the latter only predicts more disorder at the beginning and the end of the protein. The whole protein sequence (456 residues long) is predicted to be in a globular domain, according to "glob".
IUpred results for protein Q9X0E6. There is almost no difference between the "long" and "short" prediction, the latter only predicts more disorder at the beginning and the end of the protein. The whole protein sequence (101 residues long) is predicted to be in a globular domain, according to "glob".
IUpred results for protein Q08209. There are only small deviations between the "long" and "short" prediction, "long" predicts more disorder at the end on the protein, whereas "short" predicts more disorder at the beginning and the very end of the protein. According to "glob", a major part of the protein sequence - from position 5 till 446 from a total of 521 residues - is in a globular domain. Therefore, two disordered regions are predicted: one consisting only of four residues at the beginning and one containing 75 amino acids at the end of the protein.
MetaDisorder (MD) is a meta-predictor that combines several prediction methods:
- NORSnet: prediction of unstructured loops
- PROFbval: prediction of residue flexibility from sequence
- Ucon: prediction of protein disorder using predicted internal contacts
Among the prediction scores of the three predictors, MD gives the final decision on disorder as well as MDrel: reliability of the final prediction, whose values range from 0-9 (9 is the strongest prediction). The raw prediction scores as well as the MD final score for each of the four proteins are visualized in the following plots.
MetaDisorder results for protein P04062. MD prediction (red line) looks very similar to IUpred prediction, the very beginning of the protein is predicted to be disordered. The predictions of the single programs look very different and are very fluctuating. PROFbval (blue line) outputs higher scores (frequently over 0.5), than NORSnet (green line) and Ucon (purple line), however Ucon has some high peaks over 0.5. Overall, MD prediction seems more reliable, than predictions of the stand-alone methods.
MetaDisorder results for protein P10775. MD results are very similar to IUpred "short" results, the very beginning and end of the protein seem to be slightly disordered, however the score goes only a little over 0.5 for this prediction to be reliable. PROFbval prediction is the most similar to MD because of the higher scores at the ends, only the scores are overall higher (often over 0.5). NORSnet and Ucon output lower scores and still lower at the ends, Ucon has sometimes high peaks. Again, MD seems to predict disordered regions better than the single programs.
MetaDisorder results for protein Q9X0E6. Compared to IUpred prediction, MD, PROFbval and Ucon predictions look very different and fluctuating, predicting several disordered regions, which are not present in the IUpred results. Only NORSnet predicts the whole protein as lacking unstructured loops, like IUpred. Maybe the worse prediction of MD can be caused by the short length of this protein.
MetaDisorder results for protein Q08209. MD predicts disorder at the beginning and the end of the protein, as IUpred. NORSnet prediction is also similar, however it predict less disorder at the very beginning but more disorder after it (approx. position 5-40). Interestingly, both MD and NORSnet predict slight peaks around the position 240 and 375. The two other methods - Ucon and PROFbval - have very fluctuating prediction with higher scores and many peaks. Here MD and NORSnet seem to make more reliable predictions.
Human Glucosylceramidase (P04062)
For our protein there is no entry in DisProt. There is an option to search for similar sequences in DisProt, with Smith-Waterman alignment method or Psi-BLAST. Using the Psi-BLAST search, only one sequence producing significant alignment was found: DP00159. As only a short region was aligned (47 aligned columns) and the sequence identity of the alignment is only 27%, we cannot consider DisProt annotation of the found protein DP00159 for the protein P04062.
Ribonuclease inhibitor (P10775)
Also for protein P10775 no DisProt entry is found. Psi-BLAST search yielded seven matches, the one with the best score and E-value is with the sequence DP00554. As the alignment is pretty good (40% identity and 196 aligned columns), we check the annotation of the similar protein DP00554. According to the annotation, there is only one 20 residues long disordered region near the beginning of the sequence (residues 31 - 50). However, this region does not fall into the alignment (aligned positions of the query are: 144 - 339 and of the target: 787 - 982). Therefore, also this DisProt annotation is useless for our protein of interest, P10775.
Divalent-cation tolerance protein CutA (Q9X0E6)
The protein Q9X0E6 is not found in DisProt as well. With Psi-BLAST search only insignificant and short matches are found, which cannot be considered for DisProt annotation transfer.
Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform (Q08209)
It is the only protein from the four which could be found in DisProt with the ID.
According to DisProt, there are five disordered regions in the protein, two of them overlap (see the figure on the right and <xr id="Q08209_disprot"/>). These regions are at the ends of the sequence: one at the N terminus (positions 1 - 13) and four at the C terminus (altogether positions 374 - 521). The sixth region is ordered and is in the core of the sequence and is also the longest region (positions 14 - 373). (All regions map to the PDB structure 1AUI:A.)
The predictions we made with IUpred an MD for the protein Q08209 yielded similar results for the disordered regions.
|Region||Type||Name||Location||Length||Structural/functional type||Functional classes||Functional subclasses|
|1||Disordered - Extended||1 - 13||13||Relationship to function unknown||Unknown||Unknown|
|2||Disordered - Extended||374 - 468||95||Function arises via a disorder to order transition||Molecular assembly||Autoregulatory, Protein-protein binding|
|3||Disordered - Extended||CaM-binding domain||390 - 414||25||Function arises via a disorder to order transition||Molecular assembly||Protein-protein binding, Autoregulatory|
|4||Disordered - Extended||Autoinhibitory region||469 - 486||18||Function arises via an order to disorder transition and vice versa||Molecular assembly||Protein-protein binding, Autoregulatory|
|5||Disordered - Extended||487 - 521||35||Relationship to function unknown||Unknown||Unknown|
|6||Ordered||14 - 373||360|
Four Proteins, including the Gaucher's disease causing Protein, where analysed under reference by transmembrane (TM) helices. The used prediction tools differ in their analysing features. While Polyphobius only differs between residues being part of a transmembrane helix (TMH) or being inside/outside of the cytoplama, Memsat-SVM also predicts re-entrant helices and pore-linig helices. Due to the fact that pore-lining helices are also transmembrane helices, this kind of helices is detected of both prediction tools. In case of re-entrant helices both programms differ. In general a membrane helix crosses the membrane, so that both ends of the helix lie on different sides of the membrane. In contrast, the re-entrant helix leads bot its ends to the same side of the mebrane. Memsat-SVM can predict re-entrant helices, but Polyphobius treats this helices as a general membrane helices, which crosses the membrane (seen for Q9YDF8), or ignores it (seen for P47863). In case of re-entrant helices predictions also the C-terminal or the N-terminal may be predicted on different membrane sides, as well as some helices may be predicted to lie in a different direction within the membrane, because of an re-entrant helix.
The database OPM do not give the direct information about the localization of the N- and C-terminus as well as the type of the helices. Instead of differing between transmembrane and re-entrant helices, OPM classifies all identified membrane helices (MH) as transmembrane. These localizations and the helix type (transmembrane or re-entrant) can be detected from the visualisation of the protein, provided by OPM and also shown in the following tables below.
The second database PDMTM only contains transmembrane proteins. For non-transmembrane proteins, no information is available.
This Protein is not a membrane protein and is located on the extracellular side of the membrane as documented in OPM. For the same reason there exist no entry in the PDBTM, as this databse only contains membrane proteins. The prediction of Polyphobius causes to the same result. Additionally Polyphobius predicted also the signal peptide (including the N/H/C-region). MemsatSVM detected a false positive transmembrane helix. As the Glucosylceramidase cleaves lipids of cell membranes, the active site of the enzyme may be mistaken for a transmembrane helix.
|Comparison of membrane helices (MH) for Glucosylceramidase (P04062, human)|
|# of MH||1||-||-||-|
|more information||P04062||1OGS||1OGS is not in the PDBTM|
Aeropyrum pernix Voltage-gated potassium channel
For the protein of the Arachae, Aeropyrum pernix, 4 different pdb ids were found:
- 1ORQ: chain C
- 1ORS: chain C
- 2A0L: chain A/B
- 2KYH: chain A
As all pdb ids represent structures on different chains, which are not the same, it was difficult to choose one of the ids. In the end the 1ORS was choosen, because of two reasons. The x-ray structure has the highest resolution compared to the others. Aside from this, this structure represents a sensor domain and musst be important for the protein. The predictions have completly different results than the assignments. As the predictions are more similar to each other, they were compared to each other. The same was done for the two assignments. Both predictions have the same number of helices. Nevertheless some helices have a greater deviation in their position. Memsat predicted an re-entrant helix where Polyphobius detected a transmembrane helix. Thats why the N-terminal is predicted different of both programms.
The assignment of OPM has actually one helix more, but only because of a different declaration of its helices than PDBTM. The third helix of PDBTM consist of two shorter consecutive helices. Both together they form one larger helix which crosses the membrane once and are therfore seen as one helix in PDBMT. These two mini-helices which would be too short to cross the membrane alone are counted separatly in OPM. Apart from a light deviation of a few residues at the ends of the helices, the strucure is the same in both databases.
|Comparison of membrane helices (MH) for Voltage-gated potassium channel (Q9YDF8, Aeropyrum pernix)|
|# of MHs||6||7||5||4|
|MH Topology||1. 43-59
Human Lysosome-associated membrane glycoprotein 1
Both predictions have results similar to the assignments of OPM and PDBMT. All predicted transmembrane helices differ in their position only by a few residues. The protein consists of 6 transmembrane helices and 2 re-entrant helices. Polyphobius skips the re-entrant helices prediction but predicts the remaining membrane helices well. MemsatSVM predicts the re-entrant helices similar to the re-entrant helices of the database entries. Unfortunately MemsatSVM predicts the placing inside the membrane wrong. Instead of the C- and N-terminal situated in the cytoplasm, MemsatSVM places the both ends in the extracellular region.
The two assignments are very similar, OPM does not particularly signs two of its helices as re-entrant but both helices can be seen as re-entrant in the OPM visualisation. The re-entrant helices are colored gold in the PDBTM and are lightly silhouetted against the yellow transmembrane helices. All pictures can be found in the table below.
|Comparison of membrane helices (MH) for Lysosome-associated membrane glycoprotein 1 (P47863, human)|
|# of MH||8||6||8 (per chain)||8 (per chain)|
|MH Topology||1. 35-56
Human D3 dopamine receptor
The dopamine receptor is a transmembrane protein. All predicted transmembrane helices of the predictors and the databases agree mostly, with only a less shift of a few residues. While Polyphobius predicts all 7 transmembrane helices which are also documented in OPM and PDMTM, Memsat_SVM only identifies 6 transmembrane helices. As the missing helix is the last one, the C-terminus of the protein is localised extracellular instead of cytoplasmic. The programm classifies the 3rd helix as a pore-lining helix.
|Comparison of membrane helices (MH) for D3 dopamine receptor (P35462, human)|
|# of MH||6||7||7||7|
|MH Topology||1. 32-55
For the following proteins, the signal peptides as well as its cleavage sides were predigted with SignalP:
- Glucosylceramidase (P04062, human)
- Serum albumin (P02768, human)
- Aquaporin 4 (P11279, rat)
- Lysosome-associated membrane glycoprotein 1 (P47863, human)
The four eukaryotic proteins were also looked up in the Signal Peptide Database to compare the entry with the results of the prediction.
For the Glucosylcerbrosidase, the prediction of SignalP differs from the database entry.
In the database the protein has a signal peptide of 39 residues. A signal peptide is characterized with high hydrophobicity in its core region followed by the cleavage site. Especially the residues 18-23 and 27-34 indicate with its higher hydrohobicity to a signal peptide (green area in the hydrophobicity image).
However, the prediction of SignalP results in no signal peptide. On the visualisation of the different scores below, the green signal peptide score shows the most possible prediction for an signal peptide. The green line is higher for the first 39 residues than for the later residues. But the calculated D-score of the detected peptide lies with 0.37 below the threshold (0.5). The peptide is neglected as signal. These residues are not only defined as signal peptide by the database, but were also detected, with a light deviation, by the transmembrane helix predictors MemsatSVM(residues 1-34) and Polyphobius(residues 1-40).
SignalP result for P04062: The green line represents the signal peptide score. The higher the score the higher the probability of a residue being part of a signal peptide. A higher raw cleavage site score (C-score) marks the residue directly after the cleavage side. The blue line shows a combination of the C and S-score.
Serum albumin (P02768)
The signal peptide consists of residues 1-18 and is predicted of SignalP as well as documented in the Signal Peptide Databse
The images below show an clearly prediction of the signal peptide. A high S-core for the signal peptide region with a D-score of 0.85 far over th threshold. The cleavage side is predicted between the residue 18 and 19. The database shows a high hydrophobicity for the residues 6-14 which marks the region as signal peptide as well.
SignalP result for P02768: The green line represents the signal peptide score. The higher the score the higher the probability of a residue being part of a signal peptide. A higher raw cleavage site score (C-score) marks the residue directly after the cleavage side. The blue line shows a combination of the C and S-score.
Aquaporin 4 (P11279)
For Aquaporin the Scores are even higher than for Serum albumin. The signal peptide consists of 28 residues as follows:
The database shows a large hydrophobic region of 17 residues. At the end of the protein a transmembrane helix with a length of 23 residues ending in cytoplasm is documented in the Aquaporin 4 entry. The SignalP prediction gives a D-score of 0.95 for the detected signal peptide. The cleavage site is predicted between the residues 28 and 29 (ASA-AM).
SignalP result for P11279: The green line represents the signal peptide score. The higher the score the higher the probability of a residue being part of a signal peptide. A higher raw cleavage site score (C-score) marks the residue directly after the cleavage side. The blue line shows a combination of the C and S-score.
Lysosome-associated membrane glycoprotein 1 (P47863)
The rat protein has no entry in the Signal Peptide Database, as no signal peptide exists for it. The visualised results of the prediction show on the first sight, that the protein does not have a signal peptide. All scores are lower than 0.21, which is far below the threshold for signal peptides.
SignalP result for P47863: The green line represents the signal peptide score. The higher the score the higher the probability of a residue being part of a signal peptide. A higher raw cleavage site score (C-score) marks the residue directly after the cleavage side. The blue line shows a combination of the C and S-score. The threshold is marked by a red dotted line.
Not very good prediction, depends a lot on what is known.
|GOPET: GO Terms for Glucocerebrosidase|
|GO ID||Aspect||Confidence in %||GO Term|
|GO:0016798||F||97||hydrolase activity acting on glycosyl bonds|
GOPET mainly focus on the activity of the Glucosycerebrosidase. The predicted GO Terms are less but correlate with the disease description in task1 as well as the description in Pfam.
|Predict Protein: GO Terms for Glucocerebrosidase|
|GO ID||GO Term||Reliability in %|
|Biological Process||GO:0005515||protein binding||70|
sphingolipid metabolic process
carbohydrate metabolic process
polysaccharide metabolic process
The first three predicted Terms of the Molecular Function are confirmed by our knowledge about the process. The protein has no directly influence on cell death (GO:0008219), but is indirect involved as it processes the cell membrane of death blood cells. A participation on polysaccaride metabolic process is not known, as the glucosylcerebrosidase acts on fatty acids not on ploysaccarides. This GO Term is supposed to be wrong.
Functional category Prob Odds Amino_acid_biosynthesis 0.035 1.593 Biosynthesis_of_cofactors 0.182 2.528 Cell_envelope => 0.504 8.262 Cellular_processes 0.032 0.438 Central_intermediary_metabolism 0.382 6.063 Energy_metabolism 0.067 0.740 Fatty_acid_metabolism 0.027 2.088 Purines_and_pyrimidines 0.538 2.213 Regulatory_functions 0.031 0.191 Replication_and_transcription 0.126 0.471 Translation 0.082 1.863 Transport_and_binding 0.560 1.365
Enzyme/nonenzyme Prob Odds Enzyme => 0.773 2.698 Nonenzyme 0.227 0.318
Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.083 0.399 Transferase (EC 2.-.-.-) 0.228 0.660 Hydrolase (EC 3.-.-.-) 0.272 0.859 Lyase (EC 4.-.-.-) 0.045 0.961 Isomerase (EC 5.-.-.-) 0.011 0.345 Ligase (EC 6.-.-.-) 0.017 0.332
Gene Ontology category Prob Odds Signal_transducer 0.054 0.251 Receptor 0.027 0.158 Hormone 0.001 0.206 Structural_protein 0.002 0.087 Transporter 0.024 0.222 Ion_channel 0.018 0.307 Voltage-gated_ion_channel 0.004 0.195 Cation_channel 0.012 0.268 Transcription 0.070 0.550 Transcription_regulation 0.030 0.237 Stress_response 0.085 0.962 Immune_response => 0.153 1.804 Growth_factor 0.005 0.376 Metal_ion_transport 0.009 0.020
The ProtFun Server classifies P0462 as an enzyme of the class Hydrolase. The predicted Gene Ontalogy category indicates to the location of the protein, which acts in the macrophages (immunocells). The functional category "cell envelope" declares the protein to interact with the cell membrane. This is right as the protein not only interacts but processes fatty acids of membranes.
For the protein P04062 O-Glycosyl hydrolase family 30 was found on position 40-533.
O-Glycosyl hydrolase family 30
This family is a part of glycoside hydrolases known under the EC 3.2.1. Glycoside hydrolases includes a great number of enzymes that destroy glycosidic bonds between carbohydrates and other moieties
This glycoside hydrolases has the clan (CL0058): Tim barrel glycosyl hydrolase superfamily
The Pfam entry for the family PF02055 also mentions the human glucosylcerbrosidase as Gaucher disease causing.
Other available methods
|secondary structure||APSSP: Advanced Protein Secondary Structure Prediction Server|
|CFSSP: Chou & Fasman Secondary Structure Prediction Server|
|Signal Find Server|
What else can/is be predicted from protein sequence alone
- Fold recognition (profile based pGenTHREADER and rapid GenTHREADER)
- Fold domain recognition (pDomTHREADER)
- Protein domain prediction (DomPred)
- Homology modelling (BioSerf v2.0)
- Function prediction (eukaryotic function: FFPred v2.0)
- Prediction of TM topology and helix packing (SVM-based MEMPACK)
- Cleavage site prediction
- Ab initio structure prediction (not very succesfull, combinatorial problem, computational intensive, worse for longer sequences. Moreover biological molecules are not necesserily in the lowest energy comformation.)
- Solvent accesibility
- Metal binding sites, active sites
- Protein protein interactions
- SNPs effect prediction
Which predictions can be improved considerably by structure-based approaches
- Solvent accesibility