Homology Modelling GLA
by Benjamin Drexler and Fabian Grandke
Introduction
In this task, we performed homology modelling of the protein α-galactosidase A with the programs MODELLER, SWISS-MODEL, iTasser and 3D-JIGSAW. Homology modelling relies on the following two assumptions. First, the structure of the protein is determined by its amino acid sequence. Second, the structure of a protein is more conserved than its amino acid sequence. Usually one performs homology modelling of a protein which structure is not known. In this case, we have several PDB structures of the α-galactosidase A available and hence we are able to evaluate the resulting models of the programs afterwards.
General
Template Selection
The following table lists the best ten hits of the HHpred search of Task 1. We used 3HG3 (97% identity), 1KTB (53%) and 3CC1 (34%) as templates for the modelling process. This selection covers a wide range of sequence identity and hence we are able to evaluate how the sequence identity influence the quality of the models.
PDB-ID | Name | Probability | E-value | P-value | Identity | Template |
---|---|---|---|---|---|---|
> 60% sequence identity | ||||||
3hg3_A | Alpha-galactosidase A | 1.0 | 0 | 0 | 97% | x |
> 40% sequence identity | ||||||
1ktb_A | Alpha-N-acetylgalactosaminidase | 1.0 | 0 | 0 | 53% | x |
< 40% sequence identity | ||||||
1uas_A | Alpha-galactosidase | 1.0 | 0 | 0 | 39% | |
3lrk_A | Alpha-galactosidase 1 | 1.0 | 0 | 0 | 32% | |
3a5v_A | Alpha-galactosidase | 1.0 | 0 | 0 | 35% | |
1szn_A | Alpha-galactosidase | 1.0 | 0 | 0 | 34% | |
3a21_A | Putative secreted alpha-galactosidase | 1.0 | 0 | 0 | 34% | |
3cc1_A | BH1870 protein | 1.0 | 0 | 0 | 26% | x |
3a24_A | Alpha-galactosidase | 1.0 | 0 | 0 | 14% | |
1zy9_A | Alpha-galactosidase | 1.0 | 2.2E-37 | 8.8E-42 | 14% |
Evaluation
The evaluation of the models consist of two parts, i.e. a visual comparison with an experimental structure and a numeric evaluation. The PDB structures 1R46 and 1R47 were used for the evaluation. 1R46 is a structure of human α-galactosidase A without galactose (apo) and the structure 1R47 contains galactose (complexed). Both, the visual comparison and the numeric evaluation, were ony done with the chain A of the structures, because the model programs also modelled one chain.
The differences between 1R46 and 1R47 are very marginal (see figure 1) and hence we did the visual comparison with one structure, i.e. 1R47.
The numeric evaluation involves the calculation of several scores.
RMSD
The root mean square deviation (RMSD) value between the model and the reference structure was calculated by the webserver of TM-align.
The calculation of the RMSD in the catalytic site was done by PyMol. We used the annotation of the UniProt entry to determine the active sites, which are Asp170 and Asp231. We applied the following workflow:
- Import the reference structure, e.g. 1R47
- Select the residues of the active site, i.e. Asp170 and Asp231
- Expand the selection with modify -> expand -> by 6A, residues
- Rename this selection to "selection_ref"
- Import the model
- Align the model to the reference structure (align -> to molecule -> 1R47)
- Select the residues of the active site of 1R47
- Expand the selection once again, but exclude residues of 1R47 (modify -> exluce -> object -> 1R47) afterwards
- Rename this selection to "selection_model"
- Align "selection_model" to "selection_ref" with align -> to selection and retrieve the RMSD
TM-Score
We also used the webserver of TM-align for the calculation of the TM-Score. The model was always used as 'Structure 1' and the reference was 'Structure 2'. This was done, because 'Structure 2' will be used for the normalization of the TM-Score and hence we always normalize with the same number of residues.
At first, we used the command line TMS to calculate the TM-Score, but the values seemed to be wrong, i.e. way too low.
QMEAN Score
SWISS-MODEL provides also two QMEAN scores (QMEANscore4 and QMEAN Z-score). QMEAN is score to describe the model quality and consist of the five following structural descriptors. For a more detailed explanation, please read the help of SWISS-MODEL.
QMEANscore4 ranges from 0 to 1 and indicates the reliability of the model. The Z-score describes the absolute quality of the model and is calculated by comparison to reference structures.
C-Score
The C-Score is calculated by I-TASSER and is a confidence score that is based on the significane of the alignments and the convergence parameters of the structure assembly simulations. The C-Score typically ranges from -5 to 2 and a high score indicates a model with a high confidence.<ref name=itasser_cscore>C-Score explanation</ref>
Calculation of Models
MODELLER
MODELLER is a program to produce three-dimensional protein structures based on homology or comparative modelling. The user has to provide the sequence of the protein to be modeled and the structure and sequence of at least one related protein that is used as a template. MODELLER uses all atoms of the template protein, but the hydrogen-atoms. We used MODELLER as described in the tutorial Using Modeller for TASK 4. Therefore we had to align both sequences and convert them into pir-format. This alignment is given as input together with the template pdb-file. Unfortunately the input file has to be provided as python file. <ref name=modeller>http://salilab.org/modeller/</ref>
Pairwise Alignments
In this section, we used a pairwise alignment between the template (i.e. 3HG3, 1KTB and 3CC1) and the target as the input for MODELLER. All three models fairly match the structure of 1R47 (see figure 2). The model of 3HG3 seems to be the best (see figure 2A), closely followed by the model of 1KTB (see figure 2B). The largest deviations in respect to the reference structure can be observed in the model of 3CC1 (see figure 2C), especially in the coil regions of the protein.
The numeric evaluation confirms these observations. The differences of the RMSD values and the TM-Scores of the model by 3HG3 and 1KTB are very close in all columns. The results of 3CC1 suggest that the quality of the model is worse, but still very decent considering that a TM-Score above 0.5 indicates two structures with the same fold. In this case, it seems like that the difference of about 40% sequence identity between 3HG3 and 1KTB does not affect the quality of the model which is very interesting. In contrast the difference of 20% between 1KTB and 3CC1 leads to an quite observable loss in the quality of the model. It is also remarkable that the RMSD values of the catalytic site are lower than the overall RMSD value of 3HGH3 and 1KTB. So it seems like that their is an increase of the quality in the active site.
Apo (1R46) | Complexed (1R47) | |||||
---|---|---|---|---|---|---|
Template | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site |
3HG3 | 0.99473 | 0.53 | 0.326 | 0.99459 | 0.53 | 0.366 |
1KTB | 0.96933 | 1.44 | 0.439 | 0.96977 | 1.43 | 0.437 |
3CC1 | 0.78243 | 3.26 | 3.436 | 0.78291 | 3.26 | 3.405 |
Multiple Sequence Alignments
Additionally to the pairwise approach we used a multiple alignment as template for the model. Therefore we created an alignment of the sequences, provided in the table below. Then we added the target sequence to the alignment and supervised it. The supervision showed, that the sequences aligned very well in general, but the sequences 3LRK_A and 3CC1_A. Thus, those were removed and the alignment was realigned. Both, the supervised and the unsupervised alignment have been used as input for MODELLER. The table below shows the used sequences. The green color indicates, that the sequence was contained in the MSA. The red color indicates the opposite.
PDB-ID | Unsupervised | Supervised | Identity | Comment |
---|---|---|---|---|
3LX9_A | 99% | |||
3GXP_A | 99% | |||
3H53_A | 99% | |||
3HG3_A | 97% | |||
3IGU_A | 54% | |||
1KTB_A | 53% | |||
1UAS_A | 39% | |||
3LRK_A | 34% | Was removed due to little sequence identity. Caused huge gaps in alignment. | ||
3CC1_A | 28% | Was removed due to little sequence identity. Caused huge gaps in alignment. |
Both, the model of the unsupervised and supervised MSA, perform very good. They fit the reference structure very good and there are almost no deviations (see figure 3). Considering the numeric evaluation, the MSA models are even better than the model of the 3HG3 template. It is noteworthy that the exclusion of the two sequences with a low sequence identity slightly decreased the quality of the model. So it seems like that there is a benefit in including sequences with a low identity. Another explanation could be that the unsupervised had two more sequences in the MSA and hence the quality of the model is better. After all, the results indicate that it is a viable option to do the modelling with an MSA.
Apo (1R46) | Complexed (1R47) | |||||
---|---|---|---|---|---|---|
Type | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site |
unsupervised | 0.99563 | 0.47 | 0.330 | 0.99570 | 0.47 | 0.314 |
supervised | 0.99521 | 0.50 | 0.335 | 0.99537 | 0.49 | 0.329 |
iTasser
Figure 4 shows, that iTasser takes an amino acid sequence as input and tries to retrieve template proteins from PDB. In the next step fragments from the the templates are reassembled to a complete model. In the last step, the model is reassembled by taking energy calculations into account. Additionally biological function prediction is done, but that was not of interest of this task<ref name=itasser1>http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html</ref>. I-TASSER was developed by Zhang et. al in 2007<ref name=itasser_zhang>Yang Zhang. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins, vol 69 (Suppl 8), 108-117 (2007)</ref><ref name=itasser_roy>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010).</ref>.
We used the iTasser-server in two different ways:
- Standard parameters: the protein sequence is given as input and the program searches PDB for templates. The found proteins are used to create a template to predict the structure. No further arguments are given as input.
- PDB-ID as input: together with the amino acid sequence a template PDB-ID is given as input. Therefore the second input field in the "Option 1" dropout menu was filled with the certain PDB ID, as explained here. The program takes all available information into account and uses them to calculate the structure.
We used 3HG3, 1KTB and 3CC1 as the PDB-ID input in run 1, 2 and 3 respectively. Run 4 was performed with the standard parameteres. It seems like that these chosen options did not influence the outcome in our case which could be an error in the usage of I-TASSER on our side.
We received in every run two models as a result. Acoording to the visual (see figure 5) and numeric evaluation, the first result (referred as model 1) of every run is almost identical and the same applies for the second result (referred as model 2). Model 2 is always worse than model 1, but is still pretty decent. A closer look at the visual representation reveals, that these models always have a problem with four beta-sheets (on the left side) that are instead modelled as a coil region. It can be observed that 3 out of 4 "model 1" have the same C-Score, whereas the TM-Score and RMSD values allow a further distinction.
Overall the results are very good, but this is probably due to the fact that we were not able to exclude the selfhit even though we restricted the modelling to a certain template.
Apo (1R46) | Complexed (1R47) | Independent | ||||||
---|---|---|---|---|---|---|---|---|
Run | Model | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | C-Score |
1 | 1 | 0.99209 | 0.65 | 0.309 | 0.99236 | 0.64 | 0.353 | 1.883 |
1 | 2 | 0.87082 | 2.36 | 1.022 | 0.87069 | 2.35 | 0.617 | -0.314 |
2 | 1 | 0.99138 | 0.68 | 0.334 | 0.99173 | 0.67 | 0.327 | 1.870 |
2 | 2 | 0.87165 | 2.34 | 0.640 | 0.87137 | 2.33 | 0.634 | -0.327 |
3 | 1 | 0.99077 | 0.70 | 0.328 | 0.99101 | 0.69 | 0.331 | 1.883 |
3 | 2 | 0.87192 | 2.33 | 0.616 | 0.87173 | 2.32 | 0.600 | -0.315 |
4 | 1 | 0.99072 | 0.70 | 0.333 | 0.99099 | 0.69 | 0.334 | 1.883 |
4 | 2 | 0.87137 | 2.34 | 0.625 | 0.87104 | 2.33 | 0.645 | -0.314 |
SWISS-MODEL
We used the swissmodel server with two different options:
- Automated Mode: The target sequence is given as input together with the PDB ID and chain name of the template protein as described in the help page. The information about the target are given in the advanced options area below the input field of the target sequence. This method should only be used, if the sequence identity between target and template is greater than 50%.
- Aligned Mode: A pairwise alignment of template and target sequence is given as input. We created our alignments using online ClustalW2 from EBI. There are no advanced options to specify the prediction, but several input formats are valid.
Template 3HG3
In this section, we deal with the model based on 3HG3 as the template. Both models, the aligned and the automated one, match close to perfect the reference structure (see figure 6). The numeric evaluation confirms these obversations. The TM-Score is close to 1 which indicates a very good model. Additionally, the values suggest that the two models are almost identical. Hence in this case, it does not make a difference whether the aligned or the automated mode was used.
Apo (1R46) | Complexed (1R47) | Independent | ||||||
---|---|---|---|---|---|---|---|---|
Mode | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | QMEAN Z-Score | QMEANscore4 |
Aligned | 0.99504 | 0.51 | 0.278 | 0.99497 | 0.51 | 0.290 | -0.415 | 0.74 |
Automated | 0.99504 | 0.51 | 0.277 | 0.99497 | 0.51 | 0.291 | -0.415 | 0.74 |
The additional graphics of SWISS-MODEL are provided on this page.
Template 1KTB
In contrast to the template 3HG3, there is a difference between the model of the automated and aligned mode obversable. The latter shows huge deviations in respect to the reference structure (see figure 7A), whereas the model of the automated mode matches the reference structure almost perfectly (see figure 7B). Even though there are differences to the reference structure by the model of the aligned mode, the TM-Score is very good. Nonetheless the TM-Score of the automated mode is even higher and even further the RMSD values are also better. The most remarkable thing is the extraordinary RMSD of the catlytic site in comparison to the overall result which indicates a bad quality of the model nearby the catalytic site. So in this case, the automated mode would be preferable.
Apo (1R46) | Complexed (1R47) | Independent | ||||||
---|---|---|---|---|---|---|---|---|
Mode | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | QMEAN Z-Score | QMEANscore4 |
Aligned | 0.84410 | 2.17 | 5.073 | 0.84472 | 2.15 | 6.409 | -12.996 | -0.022 |
Automated | 0.96507 | 1.22 | 0.417 | 0.96559 | 1.21 | 0.404 | -2.742 | 0.599 |
The additional graphics of SWISS-MODEL are provided on this page.
Template 3CC1
Unfortunately SWISS-MODEL was not able to perfom the automated modelling with 3CC1 as template due to the low sequence identity. We received the following message: building model based on 3cc1A (1-390) was not successfull go to next best template
Hence we are only able to evaluate the result of the aligned mode. In comparison to the results of the other two templates, the model of the 3CC1 template has the greatest differences to the reference structure (see figure 8). Nevertheless the TM-Score is still pretty decent. The RMSD values are overall the highest and just like in the model of the 1KTB template, the RMSD value of the catalytic site increases. It is clearly recognizable that the quality of the model suffers from the low sequence identity.
Apo (1R46) | Complexed (1R47) | Independent | ||||||
---|---|---|---|---|---|---|---|---|
Mode | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | QMEAN Z-Score | QMEANscore4 |
Aligned | 0.60954 | 3.47 | 7.107 | 0.61051 | 3.53 | 7.357 | -14.046 | -0.067 |
Automated | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
The additional graphics of SWISS-MODEL are provided on this page.
3D-JIGSAW
3D-JIGSAW is a program that takes already calculated models as input and tries to improve these. For this, we had to create a single pdb-file that contains all the models which should be incoporated. We used the webserver of 3D-JIGSAW for our calculations.
Because we also used a template with a sequence identity to calculate the models with the other three programs (MODELLER, I-TASSER, SWISS-MODEL), we received models which were very accurate through out all programs. Since their TM-Score was already very close to 1 and the RMSD values close to 0, 3D-JIGSAW is probably not able to improve these models anymore. Therefore we decided that we also compile a second set of models that were mediocre and hence we will be able to evaluate whether 3D-JIGSAW is able to improve a set of models. This set is referred as the "mediocre models" one and the set of the best models is the "good models" one.
Good Models
This set contains the following models:
Program | Model/Run | TM-Score (1R47) | RMSD (1R47) |
---|---|---|---|
MODELLER | 3HG3 template | 0.99459 | 0.53 |
MODELLER | unsupervised MSA | 0.99570 | 0.47 |
I-TASSER | run 1, model 1 | 0.99236 | 0.64 |
SWISS-MODEL | 3HG3, aligned mode | 0.99497 | 0.51 |
SWISS-MODEL | 1KTB, automated mode | 0.96559 | 1.21 |
The visual comparison (see figure 9A) and the numeric evaluation suggest that the resulting models of 3D-JIGSAW are almost perfect. This is not surprisingly, since the input was a set of very accurate models and there was no improvement as assumed.
Apo (1R46) | Complexed (1R47) | Independent | |||||
---|---|---|---|---|---|---|---|
Model | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | Energy |
1 | 0.99472 | 0.53 | 0.377 | 0.99458 | 0.53 | 0.329 | -609.82 |
2 | 0.99503 | 0.51 | 0.416 | 0.99496 | 0.51 | 0.289 | -608.81 |
3 | 0.99503 | 0.51 | 0.416 | 0.99496 | 0.51 | 0.289 | -608.78 |
4 | 0.99469 | 0.53 | 0.376 | 0.99456 | 0.53 | 0.328 | -607.96 |
5 | 0.99492 | 0.52 | 0.277 | 0.99487 | 0.52 | 0.289 | -606.74 |
Mediocre Models
This set contains the following models:
Program | Model/Run | TM-Score (1R47) | RMSD (1R47) |
---|---|---|---|
MODELLER | 3CC1 template | 0.78291 | 3.26 |
I-TASSER | run 1, model 2 | 0.87069 | 2.35 |
SWISS-MODEL | 1KTB, aligned mode | 0.84472 | 2.15 |
SWISS-MODEL | 3CC1, aligned mode | 0.61051 | 3.53 |
3D-JIGSAW provides five models as output, where model 1 is superior according to the TM-Score, RMSD and the energy calculation and hence we conclude that this model is the best overall. The visual representation also suggest a pretty decent result, but it seems like that the model fails escpecially with some beta-sheets at the C-terminal (see figure 9B). Surprisingly, it is outperformed by some models in the RMSD of the catalytic site. Taking model 1 as the reference for this run, 3D-JIGSAW was not able to provide an improvment in respect to the set which was given as an input.
Apo (1R46) | Complexed (1R47) | Independent | |||||
---|---|---|---|---|---|---|---|
Model | TM-Score | RMSD | RMSD catalytic site | TM-Score | RMSD | RMSD catalytic site | Energy |
1 | 0.84599 | 2.59 | 1.626 | 0.84568 | 2.58 | 1.163 | -501.82 |
2 | 0.74083 | 3.68 | 2.108 | 0.74099 | 3.77 | 0.851 | -488.67 |
3 | 0.73951 | 3.61 | 1.531 | 0.74037 | 3.74 | 0.764 | -486.56 |
4 | 0.74083 | 3.68 | 2.108 | 0.74099 | 3.77 | 0.851 | -486.52 |
5 | 0.73951 | 3.61 | 0.743 | 0.74037 | 3.74 | 1.182 | -485.56 |
Discussion
In this task, we used four different programs (i.e. MODELLER, I-TASSER, SWISS-MODEL and 3D-JIGSAW) to model the structure of the human α-galactosidase. Three templates that vary in the sequence identity from 97% to 34% and two multiple sequence alignments were used. The resulting models of the runs with the 97% sequence identity template (PDB ID 3HG3) produced almost perfect models in respect to the reference structure and hence suggest that homology modelling is able to generate correct models in general. It will be rarely the case that there is a template with such a high sequence identity available in real world applications. Therefore we discuss the results of the templates with a lower sequence identity.
The template with a sequence identity of 54% (PDB ID 1KTB) was also able to provide a basis for the homology modelling process, so that the resulting models had a very good quality in general. Some of the models were only slightly worse than the models with 3HG3 as template which emphasises the capabilities of homology modelling, i.e. that a template with a lower sequence identity is also able to generate good models. In contrast, the models based on the sequence with the lowest identity (34%, PDB ID 3CC1) show a noticeable loss of quality. In this case, it seems like that the difference of about 40% sequence identity between 3HG3 and 1KTB does not affect the quality of the model that much, whereas the difference of 20% between 1KTB and 3CC1 leads to an quite observable loss in the quality of the model. It needs to be mentioned that these observations could be biased by the selection of the templates. Since most of the time a sequence alignment is involved, the sequence identity is not the only factor to take into account.
In respect to the numeric evaluation, we made the observation that RMSD value of the catalytic site was most of the time significantly lower than the overall RMSD. This could be due to the fact that conservation of the catalytic site is higher than in other regions in general. We have to remark that the calculation of the two different RMSD values was done by two different programs. Therfore this would require further examination to get more confidence in this observation.
References
<references />