Canavan Disease: Task 03 - Sequence-based Predictions

From Bioinformatikpedia

Sequence based predictions are needed, since structure or function information are in many cases not available. One possibility to overcome these problems are predictions for the secondary structure, disordered regions, transmembrane helices and signal peptides. In this specific case those are performed for aspartoacylase (ASPA - as representative of Canavan Disease) as well as for other proteins (to get acquainted with the methods).

Secondary Structure

To determine which approach to follow the proposed run-combinations for ReProf were examined. It was looked only at prediction from FASTA-sequence vs. prediction from PSSM generated by PsiBlast. Additionally the prediction of the secondary structure by ReProf with PSSM was further divided into PSSM generated by using big_80 and by using SwissProt. For further comparison a secondary structure prediction via Psi-Pred was initiated as well as a secondary structure assignment by DSSP. DSSP assigns the secondary structure using the atom coordinates stored in PDB. Therefore the assumption was made that the DSSP assignment can be used as the "true secondary structure" and the prediction methods can be compared in terms of performance to DSSP as reference. For the evaluation of the prediction methods there were however some problems: First of all the PDB-entry of ASPA assumes the protein to be a homo-dimer, however it only exists in that form when crystallized. Therefore to compare and create statistics between the prediction methods and DSSP the output of the DSSP assignment had to be double checked and only one part of the assignment (to get the monomer) could be used. Additionally the beginning as well as the ending of the DSSP assignment had to be extended with some "no secondary structure assigned symbols" to stretch the DSSP assignment data to the full length of the protein. The final statistic concerning the secondary structure prediction of aspartoacylase (P45381|ASPA) is displayed in <xr id="ACY2_statistics"></xr>.

<figtable id="ACY2_statistics">

Secondary Structure Prediction Statistics for ASPA
Precision Recall F-Measure
Type H E L H E L H E L
ReProf (FASTA) 0.773 0.822 0.562 0.829 0.446 0.808 0.800 0.578 0.663
ReProf (big_80) 0.878 0.889 0.644 0.793 0.675 0.890 0.833 0.767 0.747
ReProf (SwissProt) 0.853 0.937 0.62 0.780 0.711 0.849 0.815 0.809 0.717
Psi-Pred 0.914 0.970 0.647 0.780 0.771 0.904 0.842 0.859 0.754
Statistical overview of Precision, Recall and F-Measure for the prediction tools used,
with DSSP as reference. H = Helix, E = Beta-Strand, L = Loop. Psi-Pred shows the best performance
for ASPA. ReProf with a PSSM created by PsiBlast using big_80 as database preforms second best
but greatly outperforms (not shown) Psi-Pred in terms of speed (ReProf run locally, Psi-Pred
run on official webserver).


Psi-Pred predictions when run via the official webserver what takes up much more time than running ReProf locally on the students lab. More precisely ReProf with a position specific scoring matrix derived from big_80 was chosen (PSSM created with PsiBlast, cut-off e-10 and 3 iterations). However, out of curiosity, additionally to the ReProf prediction, Psi-Pred predictions for the remaining proteins were run nevertheless. For the predictions next to ASPA other proteins were used to become acquainted with the methods. Those can be found in <xr id="proteins"></xr>.

<figtable id="proteins">

Overview of other Proteins used for Secondary Structure and Disorder Prediction
Protein Description Organism
P10775 Ribonuclease Inhibitor Sus scrofa (Pig)
Q9X0E6 Divalent-cation tolerance protein CutA Thermotoga maritima (hyperthermophilic organism)
Q08209 Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform Homo sapiens (Human)
Overview of other proteins used for secondary structure predictions.


During the mapping of Uniprot identifier to PDB-id, some complications arose as not all proteins that where found contained the full sequence of the translated gene. The proteins that where used for the DSSP assignment where chosen manually to ensure that the whole sequence is contained within the protein, at least as part of the whole PDB-entry. Additionally some modifications had to done to ensure that the DSSP assignment has the same length as the predictions by ReProf and Psi-Pred. For example Q08209 mapped to 1AUI chain A covered most of translated gene, however parts of 1AUI could not be crystallized and the atom coordinates are missing from the PDB-file (374 - 468). As a result those positions are fully absent from the DSSP assignment as well, and had to be filled with no predicted structure. After dealing with all those complications Precision, Recall and F-measure where calculated in the same manner as it was done to decide on the preferred prediction method. An overview of the prediction statistics with the DSSP assignment as reference can be seen in <xr id="additional_statistics"></xr>.

<figtable id="additional_statistics">

Secondary Structure Prediction Statistics for P10775, Q08209, Q9X0E6
Precision Recall F-Measure
Protein Type H E L H E L H E L
P10775 (1DFJ_I) ReProf 0.974 0.959 0.793 0.945 0.855 0.912 0.959 0.904 0.848
Psi-Pred 0.976 0.980 0.630 0.814 0.873 0.938 0.888 0.923 0.754
Q08209 (1AUI_A) ReProf 0.957 0.842 0.658 0.780 0.787 0.878 0.859 0.814 0.752
Psi-Pred 0.895 0.971 0.594 0.723 0.557 0.944 0.800 0.708 0.729
Q9X0E6 (1O5J) ReProf 0.973 0.971 0.526 0.947 0.829 0.833 0.960 0.894 0.645
Psi-Pred 1.000 1.000 0.600 0.947 0.854 1.000 0.973 0.921 0.750
Statistical overview of Precision, Recall and F-Measure for the prediction tools used, with DSSP as reference.
H = Helix, E = Beta-Strand, L = Loop. For P10775 (1DFJ chain I) and Q08209 (1AUI chain A) ReProf clearly
shows the better performance. Psi-Pred shows a better performance for Q9X0E6.



Using the sequences from the secondary structure prediction the disordered regions of the protein could be predicted. Therefore IUPred and MetaDisorder were used and validated against Uniprot, PDB and Disprot. The results are presented separately in the following subsections.


Both IUPred and MetaDisorder predict the protein to be completely globular. Information about disorder could not be found in Uniprot, PDB or Disprot. As especially in Disprot no entry for ASPA could be found, a sequence search was initiated. However the sequence search did not show reasonable result either. Both Smith-Waterman and PsiBlast returned hits with e-Values such as that it should be fairly save to assume that the results are not relevant. When examined closer this assumption is proven to be true, as the best hits for both prediction methods are associated with cAMP related chemical reactions, whereas the enzymatic reaction that P45381 catalyzes is taking place completely without any form of cAMP. Furthermore the sequential overlap between the aligned sequences is rather short. Combining these facts it can be stated that taking one of the best hits to represent the information about disorder in the desired protein would most probable result in false assumptions.


Both IUPred and MetaDisorder show the same result for P10775, as can be seen in <xr id="P10775_disorder_plot"> Figure </xr>. They predict that the structure is not disordered throughout the whole protein. Searching for P10775 in Disprot to validate the findings does not directly result in an entry. However if performing a sequence search in the database via Smith-Waterman and PsiBlast the best hit for both searches is DP00554 (Uniprot ID: Q9C000). Both searches say that P10775 and Q9C000 share a region of about 200 amino acids that have a high similarity. For at least that part it can be assumed that the prediction of IUPred and MetaDisorder is correct, as the Disprot entry for Q9C000 shows that the protein has only a short disordered region (approximately 20 residues long) and the overlapping region is located distant from the disorder. <figure id="P10775_disorder_plot">

Predicted disorder tendency for each residue of P10775. Predictions are done via IUPred and MetaDisorder. The horizontal bars represent the ordered regions as predicted by the prediction methods.



Running the disorder predictions for Q9X0E6 via IUPred and MetaDisorder resulted in two different results (see <xr id="Q9X0E6_disorder_plot"></xr>). IUPred in general predicts the protein to be a globular protein with no detectable disorders. MetaDisorder however predicts four small regions of disorder within the protein, two spanning approximately 10 amino acids and the other two being even shorter.

Searching in Disprot with the 2 different approaches for the sequence search resulted in different hits. Looking at the hits delivered by using PsiBlast as search algorithm, the found hits can quickly be discarded. Firstly all proteins found have a length of 500 to approximately 1500 residues, while Q9X0E6 has a length of only 100 amino acids. Secondly the three best hits all originate from viruses (Example: best hit via PsiBlast). And finally all hits have an e-Value above 3.7 and the alignment itself only spans a region of 20 amino acids.

Using Smith-Waterman the best hit (e-Value 0.36) delivered a uncategorized protein (Q57696 - Y246_METJA) that is disordered over the complete length of the protein. The found protein is originating from Methanocaldococcus jannaschii and comparing the secondary structure information of the found protein and the original protein shows a completely different secondary structure. Q57696 is an assumed all alpha protein whereas Q9X0E6 is a mixed alpha and beta protein. Additionally, if comparing the Pfam association for both proteins it becomes visible that they belong to two distinct protein families (Q57696 -> PF01817 vs. Q9X0E6 -> PF03091) following two completely different functions. With this information in mind it can safely be assumed that the hit found in Disprot via Smith Waterman is a false hit, and therefore no related protein for Q9X0E6 can be found in the disprot database. <figure id="Q9X0E6_disorder_plot">

Predicted disorder tendency for each residue of Q9X0E6. Predictions are done via IUPred and MetaDisorder. The horizontal bars represent the ordered regions as predicted by the prediction methods.



The predictions done for Q08209 by IUPred and MetaDisorder both show that the protein has disordered regions at the start and end of the sequence. In between no disordered regions are predicted. The predicted disordered regions are predicted to be roughly 15 residues long in the starting region, and approximately 150 residues long at the end (see <xr id="Q08209_disorder_plot"> Figure</xr>). <figure id="Q08209_disorder_plot">

Predicted disorder tendency for each residue of Q08209. Predictions are done via IUPred and MetaDisorder. The horizontal bars represent the ordered regions as predicted by the prediction methods.

</figure> As Q08209 is the only protein of the chosen proteins that can directly be found in Disprot, the result of the predictions can be directly compared to the Disprot annotation. Doing this it can be observed, that the prediction of IUPred as well as MetaDisorder are reflecting the actual state of disorder for Q08209 quite well. This can be done if comparing the predicted regions of disorder and order to the Disprot entry of Q08209 (see <xr id="Q08209_disprot"> Figure</xr>). The disordered regions at the C-terminus of the protein are calmodulin binding regions and therefore possibly need this flexibility. <figure id="Q08209_disprot">

State and regions of disorder in Q08209.


Transmembrane Helices

Following the Task the transmembrane helices and topology for the three given proteins plus ASPA were predicted via PolyPhobius and MEMSAT-SVM. As running the prediction with MEMSAT-SVM automatically returned the prediction results for MEMSAT-3 too, this data was incorporated in the comparison of the results as well.


ASPA (P45381) is a protein that is located in the cytoplasm and not bound to the cell membrane therefore it should be save to expect that none of the prediction methods predicts a transmembrane helix. However PolyPhobius was the only one to do so. MEMSAT-3 predicted a helix from the amino acid position 60 to 78, even though the score is negative. MEMSAT-SVM predicted a helix ranging from amino acid 114 to 129 again with a negative score. As MEMSAT seems to test all possible combinations of helices present in the protein, ranging from the amount of 1 to n, with the possibility of 0 not tested, it could be hypothesized that MEMSAT always returns a prediction for a transmembrane helix even if the score is negative.


P35462 (PDB:3PBL) a dopamine receptor in human is a 7-helical-transmembrane protein. Prediction of the transmembrane helices was done with the aid of MEMSAT-(SVM & 3) and PolyPhobius. Interestingly MEMSAT-SVM did not predict the correct amount of helices, stopping after the sixth one. MEMSAT-3 did correctly predict seven helices despite being claimed to be worse in prediction power. PolyPhobius did achieve the best prediction for that protein, have correctly predicted all 7 helices and having predicted the borders of the helices more precisely than MEMSAT. The exact numbers can be found in <xr id="P35462_tmhs"></xr>.

<figtable id="P35462_tmhs">

Predicted Transmembrane Helices for P35462
Helix Positions
Method #1 #2 #3 #4 #5 #6 #7
OPM 34-52 67-91 101-126 150-170 187-209 330-351 363-386
PDBTM 35-52 68-84 109-123 152-166 191-206 224-247 368-382
PolyPhobius 30-55 66-88 105-126 150-170 188-212 329-352 367-386
MEMSAT-SVM 32-55 65-88 101-129 151-169 188-209 331-354 no prediction
MEMSAT-3 31-55 67-91 102-126 148-167 189-213 327-350 365-383
Overview of the predicted transmembrane helices for P35462 compared to the annotation
in OPM and PDBTM.


Additional information:


Q9YDF8 (PDB: 1ORQ/2A0L/1ORS/2KYH) is a 4-helical-transmembrane protein and its a crucial part in the forming of potassium channels. All four PDB-entries for Q9YDF8 contain a subunit that represents the potassium channel which contains the transmembrane parts of the proteins. 1ORS was chosen to represent the group of proteins, concerning the validation via OPM and PDBTM of the transmembrane helix prediction for Q9YDF8. The annotation of 1ORS and PDBTM only differs in the fact that OPM splits the third transmembrane helix into two parts with a gap of three residues, what therefore could be considered to be truly one transmembrane helix. Prediction of the transmembrane helices was done with the aid of MEMSAT-(SVM & 3) and PolyPhobius. As the amino acid sequence that is used to predict (Q9YDF8) contains the amino acid sequence of the annotated amino acid sequence (1ORS) as subsequence a index correction has to be done for the residue positions in Q9YDF8. If subtracting 15 from the residue indexes used by PolyPhobius and MEMSAT it can be observed that the prediction of the first four helices by each prediction methods is approximately correct. However PolyPhobius incorrectly predicts three helices at the end of the amino acid chain, whereas MEMSAT predicts two additional helices. It can be assumed that those three / two helices are predicted on a part of Q9YDF8 that is missing from 1ORS. Consequently it can be said that if the index correction is performed and the last three / two helices are not considered both methods, PolyPhobius and MEMSAT performed quite well. The exact numbers can be found in <xr id="Q9YDF8_tmhs"></xr>.

<figtable id="Q9YDF8_tmhs">

Predicted Transmembrane Helices for Q9YDF8
Helix Positions
Method #1 #2 #3 #4 (#5) (#6) (#&)
OPM 25-46 55-78 86-97 & 100-107 117-148 not existent not existent not existent
PDBTM 27-50 55-75 88-107 118-142 not existent not existent not existent
PolyPhobius 42-60 68-88 108-129 137-157 163-184 196-213 224-244
MEMSAT-SVM 43-59 72-90 101-118 128-143 163-184 not predicted 221-245
MEMSAT-3 38-60 66-90 100-119 122-141 161-184 not predicted 218-242
Overview of the predicted transmembrane helices for Q9YDF8 compared to the annotation of 1ORS in
OPM and PDBTM. The PolyPhobius and MEMSAT index values are raw output and not corrected (should be shifted by
approximately 15 residues to reflect a proper comparison between the annotated sequence and the sequence used
for prediction).


Additional Information:


P47863 (PDB:2D57) an aquaporin in rat is a 6-helical-transmembrane protein. Prediction of the transmembrane helices was done with the aid of MEMSAT-(SVM & 3) and PolyPhobius. The annotation of P47863 in OPM states that there are 8 transmembrane subunits. However two of them (the 3rd one ranging from residue 98 to 107 and the 7th ranging from 214 to 223) are most certain to short to be considered as helices. Using PDBTM as reference it is revealed that exactly those two segments are annotated as loops and not considered transmembrane regions. In this case every prediction tool correctly predicted the number of existent helices. PolyPhobius and MEMSAT-SVM were slightly off predicting the borders of the helices, whereas in this case the claimed inferiority of MEMSAT-3 compared to MEMSAT-SVM can clearly be seen showing less precise border prediction. The exact numbers can be found in <xr id="P47863_tmhs"></xr>.

<figtable id="P47863_tmhs">

Predicted Transmembrane Helices for P47863
Helix Positions
Method #1 #2 #3 #4 #5 #6
OPM 34-56 70-88 112-136 156-178 189-203 231-252
PDBTM 39-55 72-89 116-133 158-177 188-205 231-248
PolyPhobius 34-58 70-91 115-136 156-177 188-208 231-252
MEMSAT-SVM 35-56 71-89 113-136 157-178 190-205 232-252
MEMSAT-3 35-59 71-95 117-141 157-180 187-206 240-264
Overview of the predicted transmembrane helices for P47863
compared to the annotation in OPM and PDBTM.


Additional Information:

Signal Peptides

For the prediction of signal peptides SignalP version 4.1 (webserver) was used.


Serum albumin (P02768) is a protein that is one of the main components of blood plasma. As it clearly has to to be secreted into the blood vessels it can be expected that P02768 has motives that are crucial for the delivery down the secretory pathway and therefore contains a signal peptide sequence. This is exactly what the prediction for signal peptides using SignalP shows. SignalP predicts that P02768 has signal peptide sequence and that a cleavage site exists between amino acid position 18 and 19. Looking at the plot (see <xr id="P02768_signalp">Figure</xr>) created by SignalP v4.1 this clear signal at position 19 (0.710) can be observed.

<figure id="P02768_signalp">

Plot displaying the scores (C = cleavage, S = signal peptide, Y = combined) predicted for each amino acid by SignalP v4.1 for P02768. A clear spike for the cleavage site at position 19 can be seen, as well as high scores for signal peptide for the first 18 amino acids. (Source: Maple Sirup Urine Disease Group to prevent file duplicates in the wiki)


Additional Information:


With the findings from the Task to predict transmembrane helices it is known that P47863 is a aquaporin that is located within the membrane. The prediction by SignalP shows that neither a signal peptide sequence nor a cleavage site can be detected. Detailed graphical output can be seen in <xr id="P47863_signalp">Figure</xr>.

<figure id="P47863_signalp">

Plot displaying the scores (C = cleavage, S = signal peptide, Y = combined) predicted for each amino acid by SignalP v4.1 for P47863. Neither a spike in the c-score nor high s-scores can be seen, therefore no signal peptide sequence and no cleavage site is predicted by SignalP. (Source: Maple Sirup Urine Disease Group to prevent file duplicates in the wiki)


Additional information:


LAMP-1 (Lysosome-associated membrane glycoprotein 1 | P11279) is a membrane protein. It takes an important role in the autophagy process and is associated with tumor metastasis. It has one transmembrane helix which could be a some sort of protein anchor. Taking a look at the signal peptide prediction by SignalP reveals that LAMP-1 has an assumed signal peptide sequence and a cleavage site between the amino acids 28 and 29. This is agrees with the information stored in the Signal Peptide Database [1]. A detailed graphical output of the SignalP prediction is displayed in <xr id="P11279_signalp"> Figure</xr>.

<figure id="P11279_signalp">

Plot displaying the scores (C = cleavage, S = signal peptide, Y = combined) predicted for each amino acid by SignalP v4.1 for P11279. A clear spike for the cleavage site at position 29 can be seen, as well as high scores for signal peptide for the first 28 amino acids. (Source: Maple Sirup Urine Disease Group to prevent file duplicates in the wiki)


Additional information:


GO-Pet & Prot-Fun

The GO-Term prediction for aspartoacylase executed by GO-Pet (see <xr id="P45381_gopet"></xr>) is very accurate. Looking at the know enzymatic activity of ASPA, it can be observed that the predicted biological processes exactly reflect the chemical reaction happening.

<figtable id="P45381_gopet">

Predicted GO-Terms for P45381 (ASPA) by GO-Pet
GO-ID GO-Term / Description Confidence
GO:0016787 hydrolase activity 96%
GO:0004046 aminoacylase activity 82%
GO:0019807 aspartoacylase activity 82%
GO:0016788 hydrolase activity acting on ester bonds 81%
Overview of the predicted GO-Terms for P45381 (ASPA).


Prot-Fun interestingly correctly predicts that ASPA is an enzyme however mispredicting it for an isomerase. (Prob:Odds 0.084:2637 vs 0.115:0.363 for hydrolase). Additionally Prot-Fun can not decide on a Gene Ontology category and sorts ASPA into the functional category of "central intermediary metabolism".


The Pfam sequence search with P47863 (ASPA) directed us to the succinylglutamate desuccinylase / aspartoacylase family PF04952. The InterPro information stored referenced in Pfam further states that the family has the molecular function "hydrolase activity, acting on ester bonds "(GO:0016788) and the biological process is assigned to "metabolic process" (GO:0008152). Additional information are that the family belongs to the clan of Pepdidase_MH CL0035. The family contains 2822 sequences, 1568 species and 43 known structures.

Further Investigation

Additional properties that can be predicted from sequence:

  • tertiary structure (e.g. COILS to predict coiled coils)
  • protein-protein interaction (e.g. ISIS from PredictProtein)
  • binding site (e.g. PredictProtein or BSPred)
  • post translation modification (e.g. FindMod)

Predictions that can be improved by structure based approaches: