Homology modelling TSD

From Bioinformatikpedia

The protocol for this task can be found here.


Since similar sets were already collected for Task 2, the information was reused. In addition searches with HHpred on pdb70 and with COMA on pdb40 were performed. If two structures were mapped to the same Uniprot entry, only one, the 'most native' one, was used. The set of chosen templates is displayed in <xr id="tab:templates" />.
None of the searches revealed any structure with >80% sequence identity, other than the two already known structures 2gjx and 2gk1 which share 100% sequence identity. To still perform the task, 2gjx, which is the native structure <ref name="2gjxref">Lemieux,M. et al. (2006) Crystallographic Structure of Human beta-Hexosaminidase A: Interpretation of Tay-Sachs Mutations and Loss of GM2 Ganglioside Hydrolysis. Journal of molecular biology, 359, 913-29.</ref>, was chosen as reference and 2gk1, which has the inhibitor NGT bound, will be used as template. In the range between 40-80% sequence identity only one entry could be added from COMA. Most of the hits found by either COMA or HHpred turned out to have a sequence identity of lower than 25% identity.

<figtable id="tab:templates">

PDB id Sequence identity Method
> 80% identity 2gk1 chain A 100% Task2/HHpred/COMA
40% - 80% identity 1o7a chain D 56.6% Task 2
3lmy chain A 54% COMA
< 30% identity 3nsm chain A 27.5% Task 2
3gh5 chain A 20.7% Task 2
Table 1: Overview of possible templates and how they were found. Bold structures are the ones used for the rest of the task unless denoted otherwise.


Important residues

Based on previous analyses in the introduction to Tay-Sachs disease the following residues were deemed important and are therefore specifically observed in the subsequent alignments and structures: R178, D207, E323, N423, E462. The numbers are based om the Uniprot entry P06865. The naming of the residues in the reference structure 2gjx also corresponds to this.


Automatic alignments


The alignments produced by the simple and 2D alignments methods within Modeller are very similar. One differing position is located at the and of the first alpha helix: Since the preceding loop region is not part of the structure a large gap is needed and found by both alignment methods. However the first residue after the helix, therefore belonging to the loop region is positioned after the gap in the normal alignment and before it in the structure based one. Clearly this is a borderline case and the choice should not make a difference. Placing the threonine after the gap however aligns it with another threonine and is therefore the slightly more convincing choice.

The models produced by Modeller, given these two alignments, are shown in <xr id="fig:modeller_2gk1_overall"/>. As can be seen the two models are very similar among each other and are also very well aligned to the reference structure 2gjx. The looped regions with the important residues in particular show only marginal deviation to the reference. The only difference is found at the loop region that is not part of the reference structure. The region for the two models is shown in <xr id="fig:modeller_2gk1_loop_compare"/>. Placing the single threonine before or after the gap in the alignment has a larger effect than expected. Placing it after the gap (simple alignment method) elongates the alpha-helix by one residue, leading to a loop running along the surface of the protein, while placing it before the gap (2D-Aligner) leads to a slightly shorter helix and a loop that partly runs buried in the protein and ends with a short beta-strand. The structure based on the simple alignment approximates the reference structure slightly better and there seems to be no apparent reason for the placement of a beta-strand at the position given by the 2D-Aligner based model. Since the loop is not resolved in the reference structure there is no clear answer as to which model is better, however the loop running inside the protein creates atomic clashes with another loop region, therefore that specific model is not possible (see <xr id="fig:modeller_2gk1_2d_clashes"/>). In any case it is interesting to see what large effects the placement of a single, comparably neutral, residue like threonine can have.

<figure id="fig:modeller_2gk1_overall">
Figure 1: Structural alignment of models created with Modeller using 2gk1 as template. The reference structure is denoted in green, the models created using the simple and 2D-Aligner are coloured in magenta and cyan respectively. Important residues are highlighted in yellow and the essential residue E323 in orange.


<figure id="fig:modeller_2gk1_loop_compare">
Figure 2: Comparison of Modeller models for the loop that is not resolved in the reference structure. Color coding is equal to the one in Figure 1.


<figure id="fig:modeller_2gk1_2d_clashes">
Figure 3: Detail of the Modeller model based on the structure based alignment. Atoms of the loop region that is not present in the reference are highlighted in cyan. Atoms of the same model at another part of the strand that clash with the loop atoms are shown in orange. The reference is always shown in green and the end of the resolved part can be seen at the top left of the figure.



The two alignments (simple, 2D) of this structure with intermediate sequence identity exhibit the same difference with the placement of threonine before or after the gap. Two gaps in the middle of the sequence, both of length three, do not differ between the two alignments. A second difference is found at the end of the alignment where the 2D-Aligner inserts a five residue gap before the last two residues, while the normal alignment simply ends in the gap. Since this is the loop region at the end of the structure, there is nothing to support the shift of the last two residues and the simple alignment is the more convincing one. It also aligns asparagine with glutamine instead of the more distant glutamic acid. Whatever the choice, one would not expect a difference in the produced models.

<figure id="fig:modeller_1o7a_overall">

Figure 4: Structural alignment of models created with Modeller using 1o7a as template. Color coding is equal to the one used in Figure 1.


The two models are shown in <xr id="fig:modeller_1o7a_overall"/>. Again, there is generally a large agreement between the models and the reference structure. However compared to the high sequence similarity template stronger deviations in loop regions can be observed for both models. The loops containing the important residues however are very well aligned and undistorted. The largest differences are again found at the region of the loop that has not been resolved in the reference. Again, placing the threonine after the gap in the alignment leads to a slightly longer alpha-helix, however the resulting loop this time runs 'above' the alpha helix at the surface of the protein and towards the end enters the same cavity the loop in the model based on 2gk1 and the 2D-Aligner entered. For the 2D-Aligner based model, the loop begin and ends similarly as for 2gk1 although differing in the paths it takes inside the protein. Atomic clashes seem very likely for the latter model (not shown), so the one based on the simple alignment again seems more convincing.
The loop region at the end of the sequence that showed a one residue difference in the alignments, also differs in the models, however since it is not resolved in the reference structure there is no telling which one is better.


Since the sequence similarity is very low in this case which is reflected in the differing alignments. It is easily observable that the simple alignment methods tries to group gaps together instead of creating short breaks. The structure based alignment introduces several short gaps in looped regions for no apparent reason. More importantly the 2D-Aligner introduces a gap in the middle of the first helix, while it is preserved in the simple alignment. This is very unexpected since the whole point of the 2D-Aligner is that such cases are smoothed out instead of created in the first place. Another helix is broken by both methods, however the 2D-Aligner performs worse again, since it introduces two gaps in the helix, instead of only one. The last residue of 3gh5 is separated by a gap in both alignment methods for no apparent reason, however the structural alignment method again introduces two gaps instead of one, with no clear explanation.

<figure id="fig:modeller_3gh5_overall">
Figure 5: Structural alignment of models created with Modeller using 3gh5 as template. Color coding is equal to the one used in Figure 1.


<figure id="fig:modeller_3gh5_loopCompare">

Figure 6: Detail of Structural alignment of models created with Modeller using 3gh5 as template. The upper beta-strand is added in both models but not present in the reference. Color coding is equal to the one used in Figure 1.


<figure id="fig:modeller_3gh5_e462">
Figure 7: Detail of Structural alignment of models created with Modeller using 3gh5 as template. Shown are the differences in positioning of E462. Color coding is equal to the one used in Figure 1.


The two models created using the low sequence identity template 3gh5 are shown in <xr id="fig:modeller_3gh5_overall" />. As to be expected there are larger differences to the reference than before. While all large alpha-helices and the central beta-barrel are correctly aligned, four smaller helices and several loop regions are not aligned correctly. The important residues are mostly very well placed: E462 is slightly recessed. The increase in distance (c.f. <xr id="fig:modeller_3gh5_e462"/>) could in theory already lead to a deterioration of the hydrogen bond at this position, however this bonding was based on a substitution ligand and the real importance of this residue is not known. The same goes for N423 which is completely misaligned. The single, essential <ref name="2gjxref"/> residue E323 is correctly aligned.
The first helix is bent and too long in direction of the N-terminus. In addition the gap introduced in the structure based aligned leads to a premature end of the helix in C-terminal direction. The other helix that contains only one gap in the simple alignment and two in the structure based one ends prematurely in both models. The loop missing in the reference structure is modelled very comparably in both models by extending the nearby beta-sheet by one strongly bend strand. Given the orientation of the strands and preceding alpha-helix in the reference this is clearly wrong. A detailed view of this part is shown in <xr id="fig:modeller_3gh5_loopCompare"/>.

Evaluation of alignment methods

While the templates with higher sequence similarity result in almost equal alignments with both methods, the low similarity template shows that the 2D-Aligner exhibits exactly the opposite of the expected behaviour and actually makes the alignment worse by introducing more gaps that do not conserve but destroy secondary structure elements. Therefore the simple alignment methods is much more convincing.

This is reflected in the models created from these alignments. While differences are only present in places where correct and incorrect are hard to judge, the models created with the simple alignment always seem slightly more convincing. In any case the ones created with the structure based method are not better than the simple ones.

Manually edited alignments

All the alignments created in the previous step were analysed in respect to the correct alignment of important residues. For the 2gk1 and 1o7a alignments there is nothing to improve upon. The only editing operations that come to mind would turn the simple method based alignment into the 2D based one and vice versa. Since this has already been discussed above no further analysis is performed for these two templates.

The alignment of 3gh5 allowed for some editing operations. Taking the alignment produced by the simple method as a basis all important residues but N423 are already correctly aligned. Before that residue however there is the alpha helix, that is broken by both alignments methods. In the edited alignment this helix has been completely aligned and a gap placed before and after it, thereby only affecting loop regions. N423 is left as is, since there is no asparagine nearby that would result in a better alignment and since it is aligned to an aspartic acid which is a likely substitution that should not affect the formation of the hydrogen bond. Finally, the last residue is not placed separately and the alignment instead ends with gaps. The resulting alignment is shown in <xr id="fig:modeller3gh5edited"/>.

<figure id="fig:modeller3gh5edited">


Figure 8: Edited Alignment of 3gh5_A with P06865 based on the simple (non-structural) alignment produced by Modeller. The helix region and single residue at the end that were changed are highlighted in red. Bold, blue residues are those deemed important (c.f. above)


The model created from the edited alignment is very similar to the simple alignment it is based on. The helix that was edited, so that it contains no gaps is indeed correctly modelled and does not end prematurely anymore. Apart from that however none of the errors present in the initial model are corrected. In conclusion the improvements only seem to go as far as the edited region goes and do not propagate further into the structure. Since the most important regions were already correct in the initial alignment, the model based on the manual alignment improved only slightly.

Multiple templates

Modeller allows performing the homology modelling on the basis of several templates. Such a prediction was performed on the one hand using the two low sequence identity templates 3gh5 and 3nsm, as well was combining 3gh5 with the high sequence identity template 2gk1.

The resulting model of the combination of low sequence identity templates is shown in <xr id="fig:modeller_3gh5_3nsm"/>. The model is clearly worse than the one obtained from using 3gh5 alone, although 3nsm actually has a higher sequence identity. All the active site residues, especially the very important E323, are dislocated. In addition most major secondary structure elements are somehow shifted and the beta-sheet near the loop that is not present in the reference is not present any more and was substituted by a seemingly random accumulation of loops. Both original structures contain a beta-sheet at this region and although they show several differences it is surprising to see that not a single-strand is present in the model created.

The model from combining the high and low sequence identity templates is shown in <xr id="fig:3gh5_2gk1"/>. It is very good, with even loop regions being modelled close to perfect. The above mentioned beta-strand shows some deviations, however these are minor and should not impair function in any way. Apparently Modeller recognizes the high similarity between 2gk1 and the target and puts more emphasis on this template.

<figure id="fig:modeller_3gh5_3nsm">
Figure 9: Structural alignment of models created with Modeller using 3gh5 and 3nsm as template. The reference is shown in green and the model in magenta. Important residues are highlighted as in Figure 1.


<figure id="fig:3gh5_2gk1">

Figure 10: Structural alignment of models created with Modeller using 3gh5 and 2gk1 as template. The reference is shown in green and the model in magenta. Important residues are highlighted as in Figure 1.


Score based evaluation

Looking at the structures is a very good way to assess the quality of a model, however automatic rankings are highly desirable. Therefore the models from before are evaluated in the following by the following different scores: TM_score, GDT_HA, GDT_TS weighted and unweighted C-alpha RMSD as well as a full atom RMSD considering only the residues in 6Å vicinity of the catalytically important residue E323.


The scores calculated by TM-score are shown in <xr id="tbl:modeller_tmscore"/>. Generally the scores behave as to be expected, given the previous analyses of the model structures, with the high sequence identity template resulting in very good results, the medium identity template in only slightly worse ones and a strong deterioration for the template with low sequence identity.
For the high identity template, there are only small differences between the two models assessed. From the scores alone it is not possible to favour one model over the other, since TM-score and GDT-TS are almost equal, RMSD favours the simple alignment based model and GDT-HA favours the 2D-Aligner based model. The simple alignment based model seemed more convincing in the visual analysis but since the loop region that lead to this conclusion is not part of the reference structure, none of the scores can capture this. In addition the question remains, whether the more convincing loop modelling really is the correct one.
It can be observed that the RMSD is already comparably bad for the medium identity model, although the visual analysis suggested that this model is mostly very good. This is fact is better captured by the TM-Score and GDT-TS, while the criteria for the more rigorous <ref name="gdtha">Read,R. (2007) Assessment of CASP7 predictions in the high accuracy template‐based modeling category. Proteins: Structure, Function, and, 27-37.</ref> GDT-HA seem to be too strong already. In the visual comparison the model based on the simple alignment seemed more convincing, this is, if at all, only captured by the commond residue RMSD.
For the low sequence identity model it can be seen that the TM-score is comparably high. Since it was noted in the visual analysis that the largest deviations are found in loops and this measure is less sensitive to deviations in loops, this is to be expected. However GDT should also be less sensitive to loop deviations and reports a significantly larger drop in modelling performance. RMSD is definitely affected to much by the loops and not a suitable measure for models such as this coming from a lower identity template. The visual analysis of the model based on the manually edited alignment showed no major improvements over the simple alignment model it was based upon. This is reflected in the scores which are very similar between the two models.

<figtable id="tbl:modeller_tmscore">

Residues in common Common residue RMSD TM GDT-TS GDT-HA
2gk1 492 0.446 0.9968 0.9970 0.9472
2gk1 2D 492 0.531 0.9972 0.9970 0.9802
1o7a 492 1.521 0.9784 0.9482 0.8262
1o7a 2D 492 1.642 0.9765 0.9472 0.8293
3gh5 492 8.005 0.7614 0.5762 0.4192
3gh5 2D 492 7.219 0.7792 0.5971 0.4334
3gh5 Edited 492 7.696 0.7551 0.5579 0.4004
Table 2: Calculated TM scores for models created with Modeller.


SAP Scores and Pymol

The SAP Scores are shown in <xr id="tbl:modeller_sap"/>. The increase in the unweighted RMSD when descending to lower sequence identity templates is comparable to the full atom RMSD shown above in <xr id="tbl:modeller_tmscore"/>. However the absolute values are lower, suggesting, that the backbone is better approximated as it might seem from the full atom RMSD. The weighted RMSD, giving less importance to regions of lower similarity <ref name="sap>Taylor,W.R. and Stoye,J.P. (2004) Consensus structural models for the amino terminal domain of the retrovirus restriction gene Fv1 and the murine leukaemia virus capsid proteins. BMC structural biology, 4, 1.</ref>, delivers more useful results for the medium identity template. This already seemed fairly high in the full atom RMSD, which did not correlate with the visual analysis that found these two models to be mostly well aligned to the reference. However for the low sequence identity template the weighted RMSD remains very low as well. Given that there are not only loop regions, but also several small secondary structure elements deviating in these two models, this scores actually seems to be too optimistic.

The all atom RMSD calculated only for atoms within 6Å of the important residue E323 behaves like expected from the previous visual analysis. This region is very well modelled in all alignments. While the RMSD is slightly higher in the models of lower sequence identity, the deviation is not so strong that one would expect a detrimental effect on the proteins function. The high conservation even in far related templates, also supports the importance of this residue.

<figtable id="tbl:modeller_sap">

Weighted RMSD Unweighted RMSD (all matched) RMSD 6Å around active site
2gk1 0.365 0.434 0.049
2gk1 2D 0.246 0.304 0.122
1o7a 0.509 0.980 0.155
1o7a 2D 0.496 0.862 0.221
3gh5 1.132 2.671 0.607
3gh5 2D 1.060 3.147 0.510
3gh5 Edited 1.085 3.261 0.459
Table 3: RMSD scores for Modeller models calculated with SAP and Pymol.




The SWISS-MODEL server provides as output a model in pdb format and the corresponding target template alignment. Besides it also offers various scores and functions for the model evaluation.
ANOLEA is the atomic empirical mean force potential. Therefore a program performs energy calculations on a protein chain. Negative energy values (in green) represent favourable energy environment whereas positive values (in red) unfavourable energy environment.
There is also a local model reliability score, the residue error computed along the sequence.
QMEAN is a composite scoring function for both the estimation of the global quality of the entire model as well as for the local per-residue analysis of different regions within a model.
QMEAN4 is a scoring function to describe the model quality which can be used in order to compare and rank alternative models of the same target. It is a linear combination of the 4 statistical potential terms C_beta interaction energy, all-atom pairwise energy, solvation energy and torsion angle energy. Hereby the QMEAN raw score ranges from 0 to 1 and indicates the reliability of the model. The QMEAN Z-score represents the absolute quality of the model by describing the likelihood that a given model is of comparable quality to experimental structures. It is calculated by comparison to reference structures and has a range of -4 to 4; the smaller the value the worse the model quality <ref name="swissmodel">Arnold K., Bordoli L., Kopp J., and Schwede T. (2006). The SWISS-MODEL Workspace: A web-based environment for protein structure homology modelling. Bioinformatics, 22,195-201.</ref>. For a more detailed explanation, see [1].

Default Modelling

2gjx chain E gets automatically assigned as template. this is not surprising as the 2gjx_a is the reference and all the chains are virtually identical. The resulting alignment between target and template consists of identical matches. Only in the first region a loop in the template is missing and therefore a gap is inserted (see <xr id="fig:2gjxali"/>).

<figure id="fig:2gjxali">

TARGET 1 LWPWPQNF QTSDQRYVLY PNNFQFQYDV SSAAQPGCSV LDEAFQRYRD 2gjxE 23 lwpwpqnf qtsdqryvly pnnfqfqydv ssaaqpgcsv ldeafqryrd TARGET ss sss sssss hh hhhhhhhhhh 2gjxE ss sss sssss hh hhhhhhhhhh TARGET 49 LLFGSGSWPR PYLTGKRHTL EKNVLVVSVV TPGCNQLPTL ESVENYTLTI 2gjxE 71 llfg------ --------tl eknvlvvsvv tpgcnqlptl esvenytlti TARGET hh ! sssss ssssss 2gjxE hh sssss ssssss

Figure 11: Alignment of 2gjx_E with P06865. The gap is highlighted with a exclamation mark.


For the whole alignment see Swissmodel 2gjx alignment.
<figure id="fig:2gjxanolea">

Figure 12: ANOLEA score for the first region of the 2gjx model.

</figure> As the 2gjx chain is our chosen reference sequence it is expected to provide a perfect model. ANOLEA displays only one weakness of the model which is found at the beginning between position 70 and 90, see <xr id="fig:2gjxanolea"/>. This is also supported by the QMEAN, which signalises a potential error at this site. This is the only gap region from the alignment and thus it receives low support from the template structure.
The further scores (see <xr id="tab:swissscores2gjx"/>) show the good quality of the model, as the energy values are comparably low which signalises a favourable structure and the QMEAN4 close to 1 gives the model a good rank. With the "perfect match template" it is surprising that the Z-scores are negative as this stands for a decay in quality.

<figtable id="tab:swissscores2gjx">

Scoring function term Raw score Z-score
C_beta interaction energy -106 -0.99
All-atom pairwise energy -13340 -0.33
Solvation energy -37 -0.62
Torsion angle energy -93 -1.62
QMEAN4 score 0.68 -1.63
Table 4: Scores for default modelling of P06865 by SWISS-MODEL.


<figure id="fig:2gjxswissmodel">

Figure 13: SWISS-MODEL for the Hex A structure computed with the 2gxj template in comparison to the reference. The reference is displayed in green and for the model the error color-coding from SWISSMODEL is adopted meaning that confident residues are colored blue and the higher the estimated error the more red the residues are displayed. The active site is shown in yellow and the important H-bond residue in orange.

</figure> In <xr id="fig:2gjxswissmodel"/> the computed model is displayed, colored according to error blue-red, together with the reference which is illustrated in green. It is clear that the model fits the reference perfectly which was the expectancy as the the template is the reference just with another chain (template 2gjx_E, reference 2gjx_A). The only region where the target and reference do not correspond is the loop region colored in red belonging to the gap in the template structure, see <xr id="fig:2gjxali"/>, which was correctly detected by the SWISS-MODEL error scores. The active site and the important residues are matched perfectly by the model. As the model is very accurate the negative Z-scores remain difficult to explain.

High sequence identity

The model with 2gk1 as template is very much alike the previous model, which is not surprising as the templates have 100% sequence identity. The alignment of the Hex A subunit and the template expresses the same gap as shown above (for the whole alignment see Swismsmodel 2gk1 alignment) This alignment also correspond to the structural alignment from Modeller. <figure id="fig:2gk1anolea">

Figure 14: ANOLEA score for the first region of the 2gk1 model.

</figure> The ANOLEA and QMEAN are very good for the alignment, again except for the first region, where there are some signs of lower model quality, see <xr id="fig:2gk1anolea"/>. The scores are again very similar to those of the 2gjx template, see <xr id="tab:swissscores2gk1"/> which accredits the quality of the model. The Z-scores are also negative but show a better overall tendency. It could be that individual terms of geometrical features are slightly improved in this model.

<figtable id="tab:swissscores2gk1">

Scoring function term Raw score Z-score
C_beta interaction energy -122 -0.81
All-atom pairwise energy -13468 -0.34
Solvation energy -44 -0.12
Torsion angle energy -109 -1.06
QMEAN4 score 0.69 -0.96
Table 5: Scores for the model of P06865 with 2gk1 as template by SWISS-MODEL.


<figure id="fig:2gkiswissmodel">

Figure 15: SWISS-MODEL derived from 2gk1 template in comparison to the reference. The reference is displayed in green and is shown in error color-coding from SWISSMODEL: confident residues are colored blue and the higher the estimated error the more red the residues are displayed. The active site is shown in yellow and the important H-bond residue in orange.


The model with the reference is displayed in <xr id="fig:2gkiswissmodel"/>: There is very little red coloring which shows a low error. The only outlier is the wrongly assigned helix. Besides this the model is very accurate and conserves the active site as well as the important residues.

Medium Sequence identity

The alignment for the modelling with 1o7a as template is as matching as it was previously the case but again the most gaps appear in the beginning of the alignment, see <xr id="fig:1o7aali"/>.

<figure id="fig:1o7aali">

TARGET 1 TALWPWPQ NFQTSDQRYV LYPNNFQFQY DVSSAAQPGC SVLDEAFQRY 1o7aD 54 palwplpl svkmtpnllh lapenfyish spnstagpsc tlleeafr-- TARGET sssss ssss s ssss s hhhhhh 1o7aD sssss ssss s ssss s hhhhhhhh TARGET 49 RDLLFGSGSW PRPYLTGKRH TLEKNVLVVS VVTPGCNQLP TLESVENYTL 1o7aD 100 -----ryhgy ifgtqvq--- q---llvsi- tlqsecdafp nissdesytl TARGET sss hhhh sss ssssss sss 1o7aD hhhhh h s sssss s sss

Figure 16: Alignment of 1o7a with P06865.


For the whole alignment see Swissmodel 1o7a alignment.

<figure id="fig:1o7aanolea">

Figure 17: ANOLEA score for the first region of the 1o7a model.

</figure> The ANOLEA signals a high error for the gapped alignment region (<xr id="fig:1o7aanolea"/>). With a look at the Modeller alignment it becomes clear that this region is rather difficult to align as the 2 alignments employ different gaps. While the Modeller alignment contains only one big gap SWISS-MODEL inserts multiple smaller gaps.
The scores are a little worse than for the model with the very high sequence identity, see <xr id="tab:swissscore1o7a"/>. The energies are slightly worse, the QMEAN4 lower and the Z-score more negative.

<figtable id="tab:swissscore1o7a">

Scoring function term Raw score Z-score
C_beta interaction energy -97 -1.15
All-atom pairwise energy -11181 -0.99
Solvation energy -29 -1.24
Torsion angle energy -72 -2.33
QMEAN4 score 0.594 -2.73
Table 6: Scores for 1o7a modelling of P06865 by SWISS-MODEL.


<figure id="fig:1o7aswissmodel">

Figure 18: SWISS-MODEL derived from 1o7a template in comparison to the reference. The reference is displayed in green and is shown in error color-coding from SWISSMODEL: confident residues are colored blue and the higher the estimated error the more red the residues are displayed. The active site is shown in yellow and the important H-bond residue in orange.


The model in comparison to the reference is shown in <xr <figure id="fig:1o7aswissmodel"/>. It stand out that there is more error than in the previous models as there is more red coloring in the model structure. Especially the loop regions are pointed in wrong directions. The erroneous helix on the right of the figure, colored red, stands out as the greatest mistake of the model. It correspond to the gap region of the alignment, see <xr id="fig:1o7aali"/>, which is also highlighted in red. It seems that the region and especially the gaps are erroneous and thus the model is not accurate there. Besides the important residues are all in their right place and overall the model fits the reference quite well.

Alignment mode

The model from the 1o7a template is the first one to have a different alignment than the one proposed by Modeller. Thus it is very interesting to find out which alignment leads to a better structure and therefore the SWISS-MODEL was employed with the alignment from Modeller. This alignment is different in the way that there is only one big gap and not several small ones in the first region (see <xr id="fig:1o7a_ali"/> and <xr id="fig:1o7aali"/>).

<figure id="fig:1o7a_ali">

TARGET 21 TALWPWPQ NFQTSDQRYV LYPNNFQFQY 1o7aD 54 palwplpl svkmtpnllh lapenfyish TARGET sssss ssss s ssss 1o7aD sssss ssss s ssss TARGET 49 DVSSAAQPGC SVLDEAFQRY RDLLFGSGSW PRPYLTGKRH TLEKNVLVVS 1o7aD 82 spnstagpsc tlleeafrry hgyifg---- ---------- tqvqqllvsi TARGET s hhhhhhhhhh hhhh ssssss 1o7aD s hhhhhhhhhh hhhh ssssss

Figure 19: Alignment of 1o7a with P06865.


Indeed the scores as shown in <xr id="tab:swissscore1o7a_a"/> are better than the scores from the SWISS-MODEL alignment. The energy values are lower and the Z-score higher. The values are more comparable to those received from the models above with templates of high sequence identity.

<figtable id="tab:swissscore1o7a_a">

Scoring function term Raw score Z-score
C_beta interaction energy -126 -0.78
All-atom pairwise energy -13033 -0.48
Solvation energy -35 -0.78
Torsion angle energy -90 -1.67
QMEAN4 score 0.648 -1.87
Table 7: Scores for default modelling of P06865 by SWISS-MODEL.


<figure id="fig:1o7a_alswissmodel">

Figure 20: SWISS-MODEL derived from target-1o7a alignment in comparison to the reference. The reference is displayed in green and is shown in error color-coding from SWISSMODEL: confident residues are colored blue and the higher the estimated error the more red the residues are displayed. The active site is shown in yellow and the important H-bond residue in orange.


The <xr id="fig:1o7a_alswissmodel"/> with the superimposed reference seals it, the alignment provided by Modeller suits the structure of the Hex A subunit better than the alignment computed by SWISS-MODEL. There is very little error and no helices or sheets are erroneously modelled. This model is almost equal to the first, which was computed with the reference sequence it self.

Low sequence identity

With the 3gh5 as template the automated SWISS-MODEL was not able to calculate a model structure for the Hex A subunit. Two alignments were produced, one with Blast and one with HHsearch. While the Blast alignment quality between target and template was too low to start with, the HHsearch alignment reached the next level and was sent to modelling but the building of a model was not successful.
The same occured for 3nsm which has a sequence identity about 7% higher than 3gh5. A sequence identity lower than 30% seems to be too low for SWISS-MODEL.
As SWISS-MODEL performed better with the Modeller alignment this was also tried for 3gh5 and 3nsm. Unfortunately SWISS-MODEL did not yield any results.


The main scores provided by SWISS-MODEL from all successful models are shown in <xr id="tab:swissOwneval"/>. The QMEAN raw scores are all in the same range, only the model from the 1o7a template in automated mode is slightly worse than the rest. The Z scores are all negative which can be explained by the diverse single atom RMSDs, which can have a strong negative effect on this measure. The model from 2gk1 has the highest Z-score and should therefore be considered as the most qualitative by the internal scoring.

<figtable id="tab:swissOwneval">

QMEAN raw score QMEAN Z-score
2gjx 0.658 -1.63
2gk1 0.698 -0.96
1o7a 0.594 -2.73
Alignment mode 1o7a 0.648 -1.87
Table 8: Scores provided by SWISS-MODEL.


The RMSD as well as the GDT-TS, GDT-HA and the TM score all display the 1o7a model as the worst and the 2gk1 model as the best (see <xr id="tab:swisseval"/> ), which corresponds to the scores calculated by SWISS-MODEL. This is a little surprising as the actual reference is 2gjx but the model created with it as template does not outperform the 2gk1-model which seems to be more suitable for the P06865 modelling. The difference has only a little extent though as the 2gjx, 2gk1 and P06865 share 100% sequence identity. All scores ranks the model from 1o7a with the Modeller alignment better than the model without. The RMSD exhibits the greatest deviations which shows its sensitivity to erroneous outlier loop regions. The GDT-HA score also separates the high sequence identity models from the medium sequence identity models by expressing a stronger variation than the GDT-TS score. On the one hand it is justifiable that the 1o7a models are ranked worse as they do not have the same number of shared residues but on the other hand the models from 1o7a are also very exact. Thus the GDT-TS score could be most appropriate in this comparison.

<figtable id="tab:swisseval">

Residues in common Common residue RMSD TM GDT-TS GDT-HA
2gjx 492 0.573 0.995 0.983 0.924
2gk1 492 0.213 0.999 1.000 0.999
1o7a 486 2.411 0.952 0.913 0.802
Alignment mode 1o7a 487 1.344 0.973 0.950 0.834
Table 9: Calculated TM scores.


For completion the sap scores and the RMSD of a 6Å radius around the catalytic site are shown in <xr id="tab:swissevalsap"/>. The results correlate with the results from <xr id="tab:swissOwneval"/> as well as <xr id="tab:swisseval"/>. 2gk1 receives the best values and 1o7a the worst. It should be noted here that the model of 2gk1 stands out as the best by all scores, although it has an completely erroneous helix. The weighted RMSD makes no differentiation between the 1o7a model and the rest and is thus comparable to the GDT-TS score except for the high rank it gives to 2gk1. The RMSD around the catalytic site shows a slightly different behaviour as here the 2gjx model is assigned by far the worst RMSD.

<figtable id="tab:swissevalsap">

Matched atoms Weighted RMSD Unweighted RMSD RMSD 6Å around active site
2gjx 492 0.414 0.573 0.590
2gk1 491 0.189 0.212 0.131
1o7a 471 0.486 1.455 0.291
Alignment mode 1o7a 473 0.484 0.744 0.205
Table 10: Calculated RMSD scores with SAP and Pymol.


<figtable id="tab:activesite">

TSD activesite default.png
TSD activesite 2GK1.png
Table 11: 6 A around active site from default modelling (left) and modelling with template 2gk1 (right).


The direct comparison of the active site modelling from the 2gjx and the 2gk1 models is displayed in <xr id="tab:activesite"/>. It is clear to see why the model from 2gjx (left) has the worst RMSD in the catalytic site region as the residues point in the same directions but are disrupted otherwise. The modelling from 2gk1 (right) seems to fit the active site very well. Here the residues overlay accurately which leads to an RMSD of merely 0.131.

Overall all the measures correlate and are also very suitable for evaluation. For the performed modelling a not so strict measure seems better so that an exact model as the one from 1o7a is viewed as such.
The performance of SWISS-MODEL was satisfactory, that is iff a model could be created. The high sequence identity SWISS-MODEL requires could make it difficult to resolve structures of new proteins if there are no similar sequences available.



iTasser returns an output of 5 models and the according templates and scores. In addition to that it also provides various predictions for e.g. secondary structure or GO terms.
There are 3 scores for evaluation: RMSD, TM score and C-score.
The C-score is a confidence score for estimating the quality of predicted models in the range of [-5,2], where a C-score of higher value depicts a model with a high confidence<ref name="itasser">A Roy, A Kucukural, Y Zhang (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 . </ref>. For more detailed information see [2].

As the presentation of 5 models exceeds the scope of common perception the models are evaluated concerning superposition, C-score, TM-score and GDT-HA score and then only the best model is presented. That is especially possible as the C-score ranking strongly correlates with the TM and GDT scores even for the low sequence similarity modelling.
The complete results can for now be viewed here: default, 2gk1, 1o7a, 3gh5.


The chosen templates were 2gjxA, 3s6tA, 2gk1A, 1o7aA and 1nowA.
The best model is the first with the iTASSER evaluation C-score of 0.04, RMSD of 7.3 and a TM-score of 0.72. The C-Score and TM score preditc a reliable model but the high RMSD value is a little contradictory to that. In <xr id="fig:default_itasser"/> the superimposed model on the reference is displayed. It is overall an accurate model and seems to fit the reference structure very well apart from some loop outliers. With a closer look at the catalytic site the structures have no perfect overlay but are rather a little displaced (see <xr id="fig:defaultactive_itasser"/>).

<figure id="fig:default_itasser">
Figure 21: iTasser default model (purple) in comparison to the reference (green). The important residues are colored yellow, the active site orange.


<figure id="fig:defaultactive_itasser">
Figure 22: Active site with 6 A radius.


High sequence identity

The input template was 2gk1 and chosen were again the templates 2gjxA, 3s6tA, 2gk1A, 1o7aA and 1nowA.
The most accurate model is model 1 with the scores: C-score=1.29, RMSD=4.7 and TM-score=0.89. The C-score is even better than from the default run and the RMSD has also improved a lot. Overall here all models are very accurate and resemble each other, although the top 10 templates are the same. The model fits the reference structure very well at first sight (see <xr id="fig:2gk1_itasser"/>) but as <xr id="fig:2gk1active_itasser"/> shows, the active site is a little more shifted then from the default run. The Asp262 from the reference (green ring structure ) even stands out without a match from the model sequence.

<figure id="fig:2gk1_itasser">
Figure 23: iTasser default model (purple) in comparison to the reference (green). The important residues are colored yellow, the active site orange.


<figure id="fig:2gk1active_itasser">
Figure 24: Active site with 6 A radius.


Medium sequence identity

The input template was 1o7a and the employed templates were 1o7aA, 1nowA, 3s6tA, 3nsmA and 3ozpA.
Again the best model is the first with C-score=0.87, RMSD=5.6 and TM-score=0.83. The C-scores vary between the models from 0.87 to -2.72 but the TM and GDT scores are all equally good. <xr id="fig:1o7a_itasser"/> shows how exact the model fits the reference. The important residues are all conform and the active site is satisfactory approximated, see <xr id="fig:3gh5active_itasser"/>.

<figure id="fig:1o7a_itasser">
Figure 25: iTasser 1o7a model (blue) in comparison to the reference (green). The important residues are colored yellow, the active site orange.


<figure id="fig:3gh5active_itasser">
Figure 26: Active site with 6 A radius.


Low sequence identity

The input template was 3gh5 and the chosen templates were 3rcnA, 3gh7A and 1c7sA. It is irritating that the input template was not chosen as top 10 template but it is possible that with the employed 30% sequence identity cut-off the template was overwritten.
The best model is model 1 with a C-score of 0.30, an RMSD of 6.8 and a TM-score of 0.75. With such a low sequence similarity it is striking that the scores are only little deteriorated. As displayed in <xr id="fig:3gh5_itasser"/> the model harmonises well with the reference with no drastic outliers. The closer look exhibits that the active site is a little more shifted than in the previous models, see <xr id="fig:3gh5active_itasser"/> but it still resembles the reference well enough to be considered a valid and useful model.

<figure id="fig:3gh5_itasser">
Figure 27: iTasser 3gh5 model (red) in comparison to the reference (green). The important residues are colored yellow, the active site orange.


<figure id="fig:3gh5active_itasser">
Figure 28: Active site with 6 A radius.



As all model from iTasser look comparably well at first sight it is especially important to employ different scores in order to check the model quality. By the internal iTasser evaluation , see <xr id="tab:itasserOwneval"/>, the 2gk1 model scores best and the 3gh5 and the default model are the worst. It is very hard to imagine why a model with the same templates as the 2gk1 model could be assigned so bad scores.

<figtable id="tab:itasserOwneval">

C-score RMSD TM-score
Default 0.04 7.3 0.72
2gk1 1.29 4.7 0.89
107a 0.87 5.6 0.83
3gh5 0.30 6.8 0.75
Table 12: Scores provided by iTasser.


The TM and GDT scores computed with the experimental structure contradict the ranking of the iTasser evaluation, see <xr id="tab:itasserOwneval"/>. Here the two high sequence similarity templates score much better than the lower sequence identity templates. These groups are not only contrasted by the RMSD measure, which is known to react strongly to erroneous outliers, but also by the GDT-HA score. While the TM score and the GDT-TS score attribute the 1o7a model the same quality as the high similarity models, the 3gh5 is marked with significantly lower values. Nonetheless the scores depict a satisfactory quality of all models.

<figtable id="tab:itassereval">

Residues in common Common residue RMSD TM GDT-TS GDT-HA
Default 492 0.573 0.996 0.997 0.936
2gk1 492 0.530 0.995 0.993 0.913
1o7a 492 2.480 0.966 0.917 0.737
3gh5 492 5.837 0.853 0.678 0.488
Table 13: Calculated TM scores.


In addition the sap RMSD scores outline the same ranking as the TM and GDT scores, see <xr id="tab:itasserevalsap"/>. The first two models are the best with an RMSD around 0.4 while the 1o7a is a little worse and the 3gh5 receives an RMSD above 1. Here again as in the SWISS-MODEL evluation the Weighted RMS correlates with the GDT-score and the Unweighted RMSd with the more strikt GDT-HA score. The catalytic site region follows the same patterns as the whole structures and here the default model seems to fit the reference most accurately.

<figtable id="tab:itasserevalsap">

Weighted RMSD Unweighted RMSD RMSD 6Å around active site
Default 0.392 0.444 0.516
2gk1 0.434 0.469 0.699
1o7a 0.625 1.432 1.268
3gh5 1.156 2.767 2.716
Table 14: Calculated RMSD scores with SAP and Pymol.


Altogether the internal iTasser scoring is in accordance with the TM and GDT measures against the reference sequence within one run, meaning that the five models are ranked in the same order. Between the models of different runs the iTasser evaluation failed to score the models accurately for the tested templates. Further on the iTasser RMSD measure seems to be estimated continuously to high.
The models from iTasser however show a good quality also for the low similarity templates.

Evaluation of scores

Generally all scores are correlated although the quality scores which are only estimations of the methods tend to fall out of order sometimes. This is not surprising as these calculations are conducted without the actual reference. Thus the overall error and quality assignment of all methods can be accredited a good performance. The GDT- score is more stable towards deviating loops than the RMSD which can be very sensitive to single outliers. The GDT-HA expresses the same behaviour to some extent as it has a stricter cut-off for distance measures. In some parts it seemed to strict already and GDT-TS was more helpful. The TM-Score behaves comparable but seems to be more optimistic, although not to a degree were one could clearly favour one score over the other.
The unweighted and especially the weighted C-alpha RMSD score, are also less sensitive towards deviating loops, however they were too optimistic for the low resolution models. The RMSD score for the active site regions proved to be very helpful, however it can only act as supplement, since having only the active site correct does not yet prove that the protein would actually function as a whole.

In conclusion GDT-TS and TM-score seemed to be the most stable and helpful scores complemented by the RMSD for the active site region. However, none of them can convey the information gained from visually observing the structure.

Evaluation of modelling methods

Modeller performed very well on all ranks of sequence identity. While loops deteriorate more, the lower the sequence identity becomes, the active site regions remained mostly well aligned, especially the important residue E323. Choosing the simple alignment method or the structure based one exerted mostly little effect that could actually be judged, on the models (the most deviating part is not resolved in the reference). In total the simple alignment method seemed slightly more convincing. Another big advantage of modeller is the very short runtime.
SWISS-MODEL was very fast as well and outperformed iTasser in accuracy, especially concerning the active site region. Surprisingly the employment of the Modeller alignment could even upgrade its performance. The big deficiency was the lack of results for the low similarity templates which were handled with good results by the other methods.
iTasser was running for more than 5 days. Unfortunately the great runtime could not be justified by the results. While certainly satisfactory, these are not significantly better than the results obtained by SWISS-MODEL or Modeller.

<figure id="fig:3gh5_2gjx_simplysuperimposed">

Figure 29: Comparison of the reference structure 2gjx and the low sequence similarity template 3gh5.


Altogether the three methods were able to provide useful and accurate models in all cases. This can suggest the good performance of homology modelling tools available at present, but it is also an indicator for the structural conservation of the Hex A alpha subunit structure. The active site region was very well approximated by all methods. The main outliers were situated in loop regions, which should not affect the main structure to a greater extent. To depict the structural relationship between the Hex A alpha subunit and the low sequence identity templates 3gh5_A was superimposed with the reference structure 2gjx_A (see <xr id="fig:3gh5_2gjx_simplysuperimposed"/>). The fairly accurate match shows the high structural conservation of this enzyme's structure.