Glucocerebrosidase homology modelling

From Bioinformatikpedia
Revision as of 19:54, 22 August 2011 by Braunt (talk | contribs) (General)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


In this section, different homology modelling approaches (MODELLER, I-TASSER, SWISS-MODEL) are applied to predict the structure of glucocerebrosidase. Instructions of how to create models with the different tools are given here. All illustrations of the different models were made with Pymol.

Homology Modelling

Homology modelling refers to predicting the structure of a protein based on the known three-dimensional structures of homologous proteins (referred to as 'templates') and is based on two observations: the structure of a protein is uniquely determined by its amino acid sequence and it is better conserved and evolves more slowly during evolution than the corresponding sequence.
Homology Modelling can be devided into seven steps: <ref>Bourne P., Weissig H. (2003) Structural Bioinformatics. Wiley-Liss, Inc., Hoboken, New Jersey.</ref>

  1. Template recognition and initial alignment
  2. Alignment correction
  3. Backbone generation
  4. Loop modelling
  5. Side-chain modelling
  6. Model optimization
  7. Model validation

Template Selection

The different homology modelling approaches will be carried out with the protein sequence of 1OGS (PDB) instead of the sequence of P04062 (Uniprot), as the latter contains the signal peptide which is not present in the mature protein and therefore not needed to be modeled (the signal peptide was identified in Task 3 and therefore can be excluded, even if the protein structure would still be unknown). To retrieve homologous structures, HHSearch<ref></ref> was used to search against the database pdb70 as of 26 May 2011. The 10 best results of this search are listed in the table below. Interestingly, only homologous structures of bacteria have been found. The structures used as template in the different modelling approaches are marked with an X. As no sequences with an identity of more than 40 percent have been found (apart from 2NT0, a self hit), only sequences with an identity below 40% could be used.

> 60% sequence identity
PDB-ID name organism identity template
2nt0 Glucosylceramidase Homo Sapiens 99%
> 40% sequence identity
PDB-ID name organism identity template
> 0% sequence identity
PDB-ID name organism identity template
2wnw SrfJ Salmonella enterica subsp. enterica 28% x
3clw conserved exported protein Bacteroides fragilis 13%
3kl0 Glucuronoxylan Xylanohydrolase Bacillus subtilis 19% x
1nof Xylanase Erwinia chrysanthemi 18%
1qw9 Arabinosidase Geobacillus stearothermophilus 14%
3ii1 Cellulase Uncultured bacterium 13%
1uhv Beta-Xylosidase Thermoanaerobacterium saccharolyticum 13%
2e4t Endoglucanase Clostridium thermocellum 11% x
2c7f alpha-L-Arabinofuranosidase Clostridium thermocellum 16%

2WNW belongs to the same family as 1OGS, the O-Glycosyl hydrolase family 30, and they share high structure similarities: a (β/α)₈-barrel catalytic domain and a β-sandwich domain. Furthermore it is assumed that it has a glucosylceramidase activity as well <ref>Kim et al. Crystal Structure of the Salmonella enterica Serovar Typhimurium Virulence Factor SrfJ, a Glycoside Hydrolase Family Enzyme. 2009. Journal of Bacteriology, November 2009, p. 6550-6554, Vol. 191, No. 21</ref>. Therefore 2WNW should be a good template despite the low sequence identity. 3KL0 is a member of O-Glycosyl hydrolase family 30 as well, but does not carry out glucosylceramidase activity. 2E4T belongs to the glycoside hydrolase family. 2NT0 was not chosen as template, as it is a self-hit.

Evaluation Criteria

Figure 1: Superposition of 1OGS (red) and 2NT0 (blue).

The models obtained with the different modelling tools, will be evaluated on several different ways:

  • Visual Interpretation: Superposition of 1OGS with the different models, obtained with pymol.
  • Numeric evaluations, given by the modelling tools.
  • Comparison to the reference structure in apo-form, 1OGS, and complexed with glycerol, 2NT0. The superposition of both structures, illustrated in Figure 1, shows that both reference structures are very similar (TM-Score: 0.9995 and Cα RMSD: 0.2 Å).
  • Cα RMSD:
The Cα Root Mean Square deviation describes the distance between the backbone atoms of two superimposed structures and is therefore a good measure to assess how close the predicted and the reference structure are. DaliLite <ref></ref>, a tool, which performes a rigid body superposition and calculates the Cα RMSD for two given PDB-files, is used in this analysis.
  • TM score:
The Template Modeling Score is a measure of similarity between two different protein structures that is more accurate and sensitive than the RMSD. The differences between two structures is indicated by a score between zero and one, where the latter describes a perfect match. In this analysis, the TM score is calculated with TM-score<ref></ref>, an online version from the Zhang Lab of the University of Michigan.
  • All Atom RMSD:
For some models (the ones based on the template 2WNW of each modelling method), the all atom RMSD of the area (6 Å) around the active site was calculated with the steps listed in the detailed workflow. The results of this evaluation are given in the discussion section.


MODELLER is a method for comparative protein structure modelling, provided by satisfaction of spatial restraints. In the simplest case, the most probable structure for a given sequence can be found based on its alignment with related structures. Additional to model building, MODELLER can perform several other tasks including fold assignment, pairwise/ multiple alignments of protein sequences, calculation of phylogenetic trees, and de novo modeling of loops in protein structures. The method was published by Sali and Blundell in 1993. <ref>A. Sali & T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993.</ref>



The results presented and discussed in this section, were retrieved according to this workflow. At first, MODELLER was used to build models based on one template structure. To do so, pairwise alignments of 1OGS with the different template structures (2WNW, 3KL0 and 2E4T) were created and used as input for MODELLER. Figure 2 shows the resulting models aligned to the structure of 1OGS visualized with the tool Pymol. One can see, that the results differ greatly. The models based on the templates 3KL0 and 2E4T vary greatly from the reference structure, whereas the model based on 2WNW seems to be quite good in large parts of the protein. These visual interpretations will be further examined and validated by different measurements, as discribed in the section below.

Figure 2: Results of MODELLER for pairwise sequence alignments.

As MODELLER can build models based in multiple sequence alignments as well, it was investigated whether this might improve the structure prediction. The templates 2WNW and 2E4T were chosen for this analysis, as the models based on pairwise alignments with them seemed to be better (concerning secondary structure elements), than the model obtained with the template of 3KL0. The result of the modeling procedure with MODELLER and this multiple sequence alignment is shown below, in Figure 3. The structure of the model consists mostly of loops and only rarely a defined secondary structure is present. Overall the structure consists of 5 small helices and 2 small sheets. In this case, the multiple sequence alignment did not help at all to predict the structure of glucocerebrosidase: the results obtained by pairwise sequence alignments are significantly better. But this may not be true in general.

Figure 3: Results of MODELLER for multiple sequence alignment with 2WNW and 2E4T.


MODELLER assesses the quality of the model by calculating the so called DOPE Score (Discrete Optimized Protein Energy), which is a statistical potential optimized for model assessment. The model with the lowest DOPE score is the best one. As this score is unnormalized regarding protein size and does have an arbitrary scale, one is not able to compare the scores of different proteins. <ref></ref> The DOPE-Score was calculated with the python script shown in the workflow.
The corresponding DOPE scores of the models, described in the section above, are listed in the table below. The model based on 2WNW has the lowest score and should therefore be the best model, whereas the model based on the multiple sequence alignment seems to be the worst one.

The results of MODELLER, described in the section above, are furthermore validated by calculating the corresponding TM-Score and Cα RMSD according to the reference structure 1OGS (apo-form) and 2NT0 (complexed to glycerol). The resulting values are listed in the table below. One can see, that the values compared to either 1OGS or 2NT0 are very similar. The values calculated with 2NT0 as reference are slightly better. As already suggested after the visual interpretation, the model build on template 2WNW is by far the best one: The TM-Score is with 0.8 close to 1 and the RMSD is below 2Å. The RMSD of 3KL0 is better than the one of 2E4T, whereas the latter has a higher TM-Score. This indicates, that the secondary elements are better predicted in the model based on 2E4T, whereas the backbone of the model based on 3KL0 is more similar to the one of 1OGS. This goes along with the fact observed in the section before: The structure of the 2E4T model seems to fit better to the reference structures than the one 3KL0 comparing the secondary structure elements. The model obtained via the multiple sequence alignment is the worst one: it has a RMSD of 5 and a quite low TM-Score value. Interestingly, the TM-Score is better than the one of 3KL0 and 2E4T, although the structures of these models seem to be much better.

Template 2WNW Template 3KL0 Template 2E4T Template 2E4T & 2WNW
Numeric Evaluation
DOPE Score -53982.191406 -44973.074219 -46178.175781 -25948.031250
Comparison to 1OGS
TM-Score 0.8094 0.1886 0.2171 0.2322
Cα RMSD 1.9 2.6 3.6 5.0
Comparison to 2NT0
TM-Score 0.8090 0.1899 0.2159 0.2323
Cα RMSD 1.9 2.3 3.6 3.2


I-TASSER (iterative threading assembly refinement) is a unified platform for automated protein structure and function prediction. The model is generated by first creating a three-dimensional atomic model based on multiple threading alignments (possible template proteins found with LOMETS in PDB library) and iterative structural assembly simulations in the second step. Additional to creating a structural model, I-TASSER assigns EC-numbers, GO-terms and binding sites by structurally matching the model to known proteins. The method was established by Zhang et al. in 2007. <ref>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010)</ref><ref>Yang Zhang. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins, vol 69 (Suppl 8), 108-117 (2007)</ref>



I-TASSER was applied two times: the first run was started without any template restrictions (run 1) and for the second run (run 2), homologous templates with a sequence identity of more than 80 percent have been excluded to avoid self-hits. The results of I-TASSER are shown in Figure 4. To build Model 1 (without any template restrictions), I-TASSER used 3 different structures as templates: 1OGS Chain A, 2NT0 Chain A and 2V3E Chain A. All three structures are derived from human glucocerebrosidase and are therefore self-hits (sequence identity of 100%). As the model is based on self-hits, it is very accurate and very similar to the reference strucutre of 1OGS.
The second run, with the template limitation, returned 5 different models. I-TASSER chose two different template structures in this case: 2WNW Chain A and 3CLW Chain A, whereof 2WNW was used most. The resulting structures are not as accurate as the one based on the self-hit, but are quite good as well (especially if one notes that the sequence identity of 1OGS and the two chosen templates is very low).

Figure 4: Results of I-TASSER for 1OGS. Run 1 = without template restriction, based on 1OGS, 2NT0 and 2V3E. Run 2 = with exclution of sequences with more than 80% sequence identity, based on 2WNW and 3CLW.


The accuracy of the predictions is estimated by I-TASSER with the so called C-Score, which is a scoring function based on the relative clustering density and the consensus significance score of mutliple threading templates. The C-Score is typically ranging from -5 to 2, where 2 signifies a model with high confidence. <ref></ref>
The values for the different models obtained with I-TASSER are given in the table below. The model obtained without template restriction results in very good, almost best possible values in each measure. This is not surprising, as the model is based on a self hit. The models obtained with an exclution of templates with a high sequence identity are very good as well: the TM-Scores are around 0.8 and the RMSD values are about 2 Angstrom. The best, of the five resulting models of this run is model 1.

no template restriction no templates with a sequence identity > 80%
Template Run 1 - Model 1
1OGS, 2NT0, 2V3E
Run 2 - Model 1
Run 2 - Model 2
Run 2 - Model 3
Run 2 - Model 4
Run 2 - Model 5
Numeric Evaluation
C-Score 2.050 -0.265 -1.379 -2.284 -2.236 -1.925
Comparison to 1OGS Chain A
TM-Score 0.9987 0.8832 0.8797 0.8795 0.7819 0.8832
Cα RMSD 0.3 2.1 2.1 2.1 3.1 2.4
Comparison to 2NT0 Chain A
TM-Score 0.9990 0.8826 0.8791 0.8790 0.7810 0.8821
Cα RMSD 0.3 2.0 2.0 2.1 3.1 2.3


The SWISS-MODEL workspace is a web-based protein structure homology modelling service, which was published by Arnold et al. in 2005. It consists of three different modelling modes: automated mode, alignment mode and project mode. The template structure database that is used by SWISS-MODEL, called SMTL, is derived from the Protein Data Bank entries, which got split into individual chains. <ref> Arnold K., Bordoli L., Kopp J., and Schwede T. (2006). The SWISS-MODEL Workspace: A web-based environment for protein structure homology modelling. Bioinformatics, 22,195-201.</ref><ref>Schwede T, Kopp J, Guex N, and Peitsch MC (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Research 31: 3381-3385</ref><ref>Guex, N. and Peitsch, M. C. (1997) SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modelling. Electrophoresis 18: 2714-2723</ref>



Alignment Mode
The alignment modelling process did work for the templates 2E4T and 3KL0. Using the standard output alignment of ClustalW2, the workunit of Swiss-Model got aborted when trying to create the model with the template 2WNW: too many unfruitful attempts to rebuild a loop were tried. This indicates, that the alignment is not good and that it has to be adjusted. An analysis of the alignment (with the tool JalView) showed, that the active sites and the residues forming hydrogen bonds with the active sites (listed in the table below <ref>Kim et al., Crystal Structure of the Salmonella enterica Serovar Typhimurium Virulence Factor SrfJ, a Glycoside Hydrolase Family Enzyme. Journal of Bacteriology, 2009, p. 6550-6554, Vol. 191, No. 21 </ref>) were already correctly aligned and therefore the alignment should be optimal. SWISS-MODEL seemed to have a problem with one gap of length one. After removing this gap, SWISS-MODEL was able to perform the modelling process. The initial and the modified alignment can be seen in Figure 5. The alignments differ in the area of position 65 to position 158. Although the modified alignment has a lower quality and conservation in this area and Arg87 of 2WNW is not aligned to Arg120 of 1OGS, the modelling process workes.

Residues forming the Active Site
Residues forming hydrogen bonds with active site

Figure 5: Initial Alignment (created with MODELLER) and Modified Alignment of 1OGS and 2WNW

The results of the modelling process of SWISS-MODEL obtained with the Alignment Mode are shown in Figure 6. As one can see in the image, the model based on the template 2WNW seems to fit best, although there were the modeling problems with the initial alignment. The model obtained with 3KL0 as template seems to fit the correct structure well, too. The structure of the model based on 2E4T adopts similar folds, but seems to be shifted a little bit.

Figure 6: Results of SWISS-MODEL (Alignment Mode) for 1OGS

Automated Mode with specified templates
Although indicated, that the Automated Mode of SWISS-MODELLER should only be applied to templates with a high sequence similarity (> 50%), this mode was tried out as well. The model based on 2WNW yields a structure very similar to the one of 1OGS, as one can see in Figure 7. The model obtained with 2E4T as template seems to miss some helices in comparison to the reference structure. The modelling process with 3KL0 as template got aborted: no model could be created. Figure 8 shows the resulting models of the Alignment and the Automated Mode in comparison. The models based on 2WNW are very similar, whereas the ones based on 2E4T differ a lot.

Figure 7: Results of SWISS-MODEL (Automated Mode) for 1OGS
Figure 8: Comparison of SWISS-MODEL results with Automated and Alignment Mode

Automated Mode without specified template
The Automated Mode of SWISS-MODEL works as well, if no template is specified: In this case, SWISS-MODEL searches for templates itself. In this case, 2NT0 was chosen by SWISS-MODEL as template for the modelling process for the sequence of 1OGS. As 2NT0 is a self-hit, the resulting structure (displayed in Figure 9) is almost identical with the reference one.

Figure 9: Result of SWISS-MODEL (Automated Mode without specified template). 2NT0 was chosen as template.


Figure 10: Normalized QMEAN4 Scores of the different models for glucocerebrosidase created with SWISS-MODEL

SWISS-MODEL calculates a global score for the whole model, the so called QMEAN4 Score. This score consists of a linear combination of 4 statistical potential terms: C_beta interaction energy, all-atom pairwise energy, solvation energy and torsion angle energy and reflects the predicted model reliability ranging from 0 to 1. Furthermore a Z-Score (standard score) with respect to scores obtained for high-resolution experimental structure of similar size is indicated. The Z-Score is a measure for the absolute quality of the model.
The Z-Scores and QMEAN4 Scores of the different models are listed in the table below and the Z-Score is additionally illustrated in Figure 10 to the right. A comparison of the different values shows, that the Automated Mode yields better results than the Alignment Mode. This indicates, that the alignment created with ClustalW for 2E4T and the modified MODELLER alignment for 2WNW are not as good as the ones created by SWISS-MODEL itself. The model obtained with 2WNW as template yields the best results. For the Automated Mode without specified template no Scores were calculated.
Additional output graphics of SWISS-MODEL are provided in this file: File:Additional swiss model output.pdf

The validation of the corresponding TM-Scores and Cα RMSD values goes along with the numeric evaluation: The models created with the automated mode are slightly better than the ones created with the alignment mode and 2WNW as template yields the best results. The two values show furthermore, that the model created with the automated mode without specific template is very close to the reference structure: the RMSD is less than 1Å and the Cα RMSD is almost 1.

Alignment Mode Automated Mode
with specified temp.
Automated Mode
without specified temp.
Template 2WNW 3KL0 2E4T 2WNW 3KL0 2E4T 2NT0
Numeric Evaluation
QMEAN4 Score 0.412 0.255 0.109 0.531 - 0.281 0
Z-Score -5.865 -8.587 -11.175 -3.977 - -7.546 0
Comparison to 1OGS Chain A
TM-Score 0.7766 0.4839 0.1860 0.8590 - 0.4296 0.9961
Cα RMSD 2.3 2.5 3.6 2.1 - 3.5 0.5
Comparison to 2NT0 Chain A
TM-Score 0.7763 0.4841 0.1870 0.8584 - 0.4290 0.9965
Cα RMSD 2.0 2.5 3.7 1.9 - 3.5 0.5


All Atom RMSD of Active Site

Figure 11: 6A Area around active site of 1OGS compared to different models.

For one model of each homology modelling tool, the all atom RMSD of the 6 Angstrom area around the active site to the reference structure (1OGS) was calculated according to the following workflow. For MODELLER, the model based on the template 2WNW was chosen, as it is the most similar one to the reference structure. The model based on 2WNW, calculated with the Alignment Mode, was chosen for SWISS-MODEL, although the RMSD and TM-Score values are slightly worse than the ones retrieved with the Automated Mode. The model based on the latter misses some residues in the beginning of the protein and is therefore not used in this analysis. For I-TASSER, the model obtained with exclution of homologous structure was chosen. The 6Å area around the active site (E235 and E340) of glucocerebrosidase (1OGS and 2NT0) consists of 30 different residues. An illustration of this area for 1OGS is shown in Figure 11 to the right, and the resulting all atom RMSD values are listed in the table below. The model obtained with I-TASSER has the lowest RMSD and performs therefore best, whereas SWISS-MODEL returns the worst model. This goes along with the results obtained with the TM-Score: The I-TASSER model has the highest TM-Score of 0.8832 and the chosen SWISS-MODEL model has the lowest TM-Score with 0.7766.

All Atom RMSD (6Å around active site)
2WNW - Alignment Mode
2WNW, 3CLW - Model 1
to 1OGS Chain A
2.07 2.93 1.18
to 2NT0 Chain A
2.19 2.93 1.19


3D-JIGSAW is a server which can build an optimised model based on already calculated ones. Therefore the tool splits the given models in different domains and uses them.<ref></ref> With this tool, it was tried to optimize and improve the models obtained in the sections above.



Figure 12: 5 optimized models obtained with 3D-Jigsaw based on 3 input models

In this analysis, one of the best models of each modelling tool was chosen as input model (model based on 2WNW obtained with MODELLER, model based on 2WNW obtained with SWISS-MODEL (Alignment Mode), model obtained with I-TASSER Run 2/ Model 1). The resulting optimized models of 3D-JIGSAW are shown in Figure 12 to the right and the corresponding Ramachandran Plots and Energy Plot are listed in the following File:Glucocerebrosidase ramachandran plots 3d jigsaw best models.pdf
The table below indicates the energy values of the different models and the comparison to the reference structures 1OGS and 2NT0. Model 1 to 4 have quite similar energy values and the TM-Scores and RMSD values are the same for each model. Model 5, which has a slightly lower energy than the other fout models has slightly worse RMSD values and slightly better TM-Scores. Overall, the models obtained with 3D-JIGSAW have slightly worse RMSD values and TM-Scores than the MODELLER input model, better RMSD values than the chosen input model of I-TASSER, and better values for both compared to the input model of SWISS-MODEL. This indicates, that 3D-JIGSAW was not able to build a model better than each of the input models.

Model 1 Model 2 Model 3 Model 4 Model 5
Energy -636.79 -635.93 -635.87 -635.86 -627.53
Coverage 1.00 1.00 1.00 1.00 1.00
Comparison to 1OGS Chain A
RMSD 2.0 2.0 2.0 2.0 2.1
TM-Score 0.7931 0.7931 0.7931 0.7931 0.7955
Comparison to 2NT0 Chain A
RMSD 1.9 1.9 1.9 1.9 2.0
TM-Score 0.7929 0.7929 0.7929 0.7929 0.7953

Comparison of different Models

Models based on self-hits like 1OGS and 2NT0 are not taken into account in this comparison, as this can only happen if the structure of the target protein is already known. Overall, models based on 2WNW yield the best results, compared to the models based on 3KL0 and 2E4T as templates. 2WNW is a bacterial protein and only shares 28% sequence identity with 1OGS, but as it has a glucocerebrosidase activity as well it seems to be a good model for human glucocerebrosidase. This shows that the structure of a protein is better conserved and evolves less than the sequence of a protein.

I-TASSER created the model with the highest TM-Score, whereas MODELLER built the one with the best RMSD value. The area around the active site was modelled best with I-TASSER, as well. Overall, all three modelling tools yield quite good and similar results, based on the template 2WNW.


<references />