Homology-based structure prediction (PKU)

From Bioinformatikpedia

Short Task Description

After the sequence based predictions of function and secondary structure for our protein we will determine the 3D structure of the wild type protein. Of the variety of methods to be used for tertiary structure prediction, we choose homology modeling as a first approach to our goal. Read the complete task description here. The protocol of commands and scripts can be found in our journal.


<figure id="fig:1pahstruct">

The Phenylalaininehydroxylasemonomer

</figure><figure id="fig:2pahstruct">

The Phenylalaininehydroxylasetetramere, this is the active polymere of the reference protein P00439


Due to our prior knowledge of the protein responsible for PKU, the evaluation of the methods is easier than for a truly unknown sequence. In <xr id="fig:1pahstruct" /> one can see the monomer and the active site of Phenylalaninehydroxylase. On the right side ( <xr id="fig:2pahstruct" />) one can see the polymere in its active form which is found in the human body.

Model Construction

Here we will show the steps we took building the models we then use and evaluate. In order to start the mere model-building, we first have to construct some datasets of structures, which will be the founding of our models.


These datasets were derived from several sources. They all consist of PDB-entries, but we ensured to not include the already known structure of our protein, so we have a better insight in the topic of homology modeling with a completely unknown sequence.


<figtable id="tab:datasetpdbe"> Dataset PDBe

pdb ID E-value Identity in %
> 80% sequence identity
2phm 4.1e-148 95.5
40% - 80% sequence identity
2xsn 6e-100 61.1
1toh 1e-99 60.8
3e2t 8.5e-99 64.4
1mlw 1.1e-95 66.1
3hf8 1.5e-92 66.4
< 30% sequence identity
3l0i 6.7 25
3uan 18 24.8
1vkj 20 24.8
3hv0 71 21.7


For this set of datasets we used the webservice of sequence similarity search provided by the pdb called PDBeXplore, which can be accessed here. In the used dataset (see <xr id="tab:datasetpdbe" /> we filtered the received data from pdb, e.g. we did not use the structure of both the monomer and the dimer etc. We also did not use the structure with different ligands to keep the variability high.
In the dataset of sequences above 80% we only found one significant hit, which is the structure for Phenylalanine Hydroxilase dephosphorylated. This is a marginal case for the noninclusion of the protein itself, but we decided to include it, since it's from another organism.
The dataset with sequence identity from 40% to 80% sequence identity only contains structures in connection with hydroxylases, namely tryptophan and tyrosine hydroxylases from chicken and rat though the structure gained from the rat also contains the tetramerisation domain we also find in our reference structure. But we also found tryptophan and tyrosine hydroxylase structures in the pdb derived from human.
As for the lower than 30% dataset, we can not really expect to find useful output here, because the best E-value we could find is 6.7. We will still use these structures to test our methods and see if the results are better than randomly expected.


<figtable id="tab:datasetHHPred"> Dataset HHPred

pdb ID E-value Identity in %
> 80% sequence identity
1phz 1.5e-159 92
1j8u 2.7e-143 100
40% - 80% sequence identity
1toh 4.9e-147 60
1mlw 4.1e-137 65
< 30% sequence identity
3luy 1.3e-17 21
3mwb 7.1e-15 20
2ko1 0.38 11
1zpv 0.26 12


The dataset with the highest sequence similarity in <xr id="tab:datasetHHPred"/> contains two structures with a very high similarity, with is due to the fact, that the structure is that of the original protein in different states. One is the protein in complex with Tetrahydrobiopterin (BH4), which is a co-factor for the PheOH-activity. The other is the phosphorylated protein structure.
In the second dataset (40%-80%) we find two of the structures which were alread explained above.
The last dataset from HHPred contains five structures which only descend from bacteria where only one of the structures has a direct connection with PheOH as this one binds L-Phe. The others all are connected or part of the ACT-domain which is known to be controlled by amino-acid concentration, which relates to our target protein.


<figtable id="tab:datasetComa"> Dataset Coma

pdb ID E-value Identity in %
> 80% sequence identity
1phz 3.4e-126 87
40% - 80% sequence identity
< 30% sequence identity
2v27 1.9e-79 28
1ltu 6.9e-77 27
3luy 2.6e-21 16
3mwb 2.8e-20 16
2qmx 3.7e-20 20
2qmw 2.2e-19 21


In the above-80%-identity-dataset we find again our known structure as seen above.
Unfortunately we did not receive any result for our second dataset.
But the choice for our third dataset was great. We chose the PheOH-counterpart from the CHROMOBACTERIUM VIOLACEUM namely 1ltu and one (2v27) from COLWELLIA PSYCHRERYTHRAEA 34H which is a version of the protein, that works in a much colder environment. 3luy as well as [1] plus the new 2qmx and 2qmw fall into the ACT-domain-group as mentioned at the HHPreddataset

Comparison of datasets

In summary, one can say that the approaches all find similar results for the dataset with 80% sequence similarity or above. Those are all structures of Phenylalaninehydroxylase with different modifications or from different organisms. But only Coma and HHPred find the same (1phz) structure, whereas PDBeXplore finds a completely different identifier, which is the protein with its co-substrate.
Most differences occur in the second dataset, in which PDBeXplore finds a lot of possible candidates with a very good e-Value range. But the other two do only find two or even no result at all.
In the third datset (<30% identity) PDBeXplore finds some candidates, but the e-Values are to high too be considered good hits. In contrary to HHPred and Coma which both found good hits with a low e-Value as well as identity in the desired range.
What also might have been observed by an interested reader, are the differences in identity and e-Value throughout the supposed to be identical hits, like 3luy. We are not entirely sure why these might arise, but since the difference is not that significant we expect them to descend from different alignment scores, database size corrections and/or weighting.


Parsing of HHPred and Coma was performed with help from the scripts of group Fabry Disease. Thank you for your inspiration.


We created homology models with different methods, in order to examine the structure of our unknown protein.


Here we will show you the models one can gain from Modeller <ref name="modeller">A. Šali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993. </ref>. We used a local version at our private machines.
Modeller offers mainly two possibilities:

  • single template modelling
  • multi-template modelling

We are going to show you the differences and possibilities this offers.

<figure id="fig:compare1and2pah">

structure of 1pah (blue) and 2pah (red) superimposed

</figure>In order to assess the quality of the models we will show the reference protein in blue for all the following images. These were aligned using the pymol alignment algorithm. You might see some modelled domains which do not appear in the 1PAH structure shown in <xr id="fig:1pahstruct" />. This stems from the pdb entry, which only contains the catalytic domain structure of the protein. So we decided to show in <xr id="fig:compare1and2pah" /> the two 'different' structures so you know the difference when we refer to the binding domain or the catalytic domain of the protein.

Single Template

In this part you choose one sequence which you believe to be the closest relative to your target sequence and model it after the alignment of those two. As we did our dataset creation before, the choice of sequences is already done. We now only have to use the single template-script to first align and then actually model this pairwise alignment. Then it will be assessed with DOPE-score, GA341<ref name="GA341">Francisco Melo, Roberto Sánchez, Andrej Sali; Protein science : a publication of the Protein Society, Vol. 11, No. 2. (February 2002), pp. 430-448, doi:10.1002/pro.110430</ref><ref name="DOPE">Min-yi Shen and Andrej Sali Protein Sci. 2006 November; 15(11): 2507–2524.doi: 10.1110/ps.062416606</ref>, TM- and GDT-scores.
In the following we show you three exemplary models from each of the sequence identity sections.


To see the optimal quality one can gain from a modeling we used 1phz which we expected to have the highest similarity to 1pah. In <xr id="fig:1phzmodelling" /> one sees the simliarity to <xr id="fig:1pahstruct" />

<figure id="fig:1phzmodelling">

Modelling of 1pah (blue) with 1phz (red) imposed


The structures almost match perfectly in the catalytic domain, but the coiled ends and beginnings differ. The binding domain is missing completely.


With the lower sequence identity we used the one with the highest e-Value in order to see the optimal performance here. The colorcoding is the same as above.

<figure id="fig:2xsnmodelling">

Modelling of 1pah with 1toh (red) and the reference 1pah (blue) superimposed


Again we see the coiled ends, which in this part are even longer than before, which is due to the fact, that the 1TOH related sequence is about 100 residues shorter than the 1pah. But we also see some differences in the catalytic domain as well. For example the helix left of the binding domain is a bit more kinked and therefore overlaps the helix of the reference. A bit on top of this, there is a sheet, which only exists in the model, which might be a result of the different substrates the two domains bind. Overall, the differences are really minor and the catalytic domain is almost peferctly matched. In contrast to the modelling with 1phz one has to point out the binding domain, which is modeled here.


With a sequence identity as low as in this model one would expect very bad results. Our results can bee seen in <xr id="fig:2v27modelling" />

<figure id="fig:2v27modelling">

Modelling of 1pah with 2v27 (blue) and the reference 1pah (red) overlayed


They are really of low quality, as almost no helix or sheet from the model matches one from the reference and even the binding domain is just coiled and seperated. This time the catalytic domain is not matched very well and the binding domain is not matched at all.

Multiple Templates

We created three models using multiple templates. The procedure follows the tutorial from previous year students.

Used templates are:

  • 1VKJ and 3HV0

<xr id="tab:modeller_multi"/> on the left shows the overlay of a model created with two templates of low (<30%) sequence similarity in red with the true structure in blue. As can be seen, much of the model is rather randomly folded, especially the loop formed by residues approximately 320 to 380 is badly misshaped.

  • 1VKJ and 2PHM

<xr id="tab:modeller_multi"/> in the middle shows the overlay of a model created with one templates of low (<30%) and one with high (<80%) sequence similarity in red with the true structure in blue. Here we see a big improvment to the model before. The catalytic domain is almost perfectly matched and the coiled part which could be found at the left picture has been reduced.

  • 2PHM and 1TOH

<xr id="tab:modeller_multi"/> on the right shows the overlay of a model created with two templates of high (one >80% and one >60%) sequence similarity in red with the true structure in blue. We almost got the same structure as in <xr id="tab:modeller_multi"/> in the middle. But this time there is even a well shaped bindingdomain modelled. Here you can see that modeller is able to get better results with a multi template version than with one single. In <xr id="fig:1phzmodelling" /> the catalytic domain was matched equally well, but the binding domain was missing, which is now completely well formed.

<figtable id="tab:modeller_multi"> Modeller predictions with multiple templates

Overlay of modeller prediction of PAH using 1VKJ and 3HV0 PDB structures as template in blue with the 2PAH PDB structure in red
Overlay of modeller prediction of PAH using 1VKJ and 2PHM PDB structures as template in blue with the 2PAH PDB structure in red
Overlay of modeller prediction of PAH using 1TOH and 2PHM PDB structures as template in blue with the 2PAH PDB structure in red



We submitted several templates to the Swiss Model server, the resulting models are shown in <xr id="tab:Swiss_Model"/>. As Swiss-Model results were mostly dimeric with identical chains, we restricted the results to only using one chain.

  • Fully automatic prediction

The server choose the best template on its own and used the PDB structure 1J8U:A. Here again we find a perfect match as we could see in Modeller before.
This is not extremely surprising, since the identity of 1J8U and 1PAH is 100%, but due to the co-substrate binding at 1J8U some differences might be expected.

  • 1TOH:A

Feeding 1TOH manually as template results in the models seen in <xr id="tab:Swiss_Model"/> middle and right. Again we see a very good matching of the structures, apart from the length of some helices of sheets, but the positioning of the domains and even the bindingdomain is matched very exactly. The results are even closer to the reference than they were with modeller. Aligning the template with T-Coffee did not change the result significantly as the automatic alignment appears already very accurate.

We also tried a number of other structures with lower sequence similarity as templates, both with and without a prealigned sequence. In each case, the server aborted the prediction because the alignment quality was too low to be used as template properly.

<figtable id="tab:Swiss_Model"> Swiss Model server predictions with and without prealigned templates

Overlay of automatic Swiss Model server prediction of PAH in blue with the 2PAH PDB structure in red
Overlay of automatic Swiss Model server prediction of PAH using the 1TOH PDB structure as template in blue with the 2PAH PDB structure in red
Overlay of automatic Swiss Model server prediction of PAH using the 1TOH PDB structure as prealigned template in blue with the 2PAH PDB structure in red



We submitted the following templates to the I-Tasser web server:

  • fullly automatic with a cut-off to exclude sequences above 95% sequence identity (chosen by server: 2PHM:A)

Here again we find a very nice matching of reference and model, which is even better than the modeller approach, because here the binding domain is nicely formed.

  • 1TOH:A

In this modelling approach the catalytic domain is matched quite similar to the one with 2phm, but the binding domain is not matched at all. Though it is much better than the related modeller model, which predicted this as one long coiled region.

  • 1VKJ:A

Surprisingly with this template I-Tasser matches the reference better than with 1toh, but in the model evaluation we will discuss possible explanations for that. The result is almost as good as the one with template 2phm, but for a few small differences in helix and sheet positioning.

<figtable id="tab:I-Tasser"> I-Tasser server predictions

Overlay of automatic I-Tasser server prediction of PAH in red with the 2PAH PDB structure in blue
Overlay of automatic I-Tasser server prediction of PAH using the 1TOH PDB structure as template in red with the 2PAH PDB structure in blue
Overlay of automatic I-Tasser server prediction of PAH using the 1VKJ PDB structure as template in red with the 2PAH PDB structure in blue


Model evaluation

We show the TM scores, GDT scores and RMSD value for all presented models in <xr id="tab:modelling_scores"/> and comment on the individual methods or models in the appropriate subsections.

<figtable id="tab:modelling_scores"> Achieved scores of the prediction methods

template ID TM-Score GDT-TS Score GDT-HA Score RMSD value
Swiss Model server
1J8U 0.904 0.890 0.795 0.508
1TOH 0.916 0.872 0.718 0.879
1TOH (prealigned) 0.916 0.871 0.720 0.884
1VKJ invalid template for Swiss Model
I-Tasser server
2PHM 0.925 0.899 0.727 0.828
1TOH 0.931 0.857 0.677 0.831
1VKJ 0.925 0.895 0.725 0.909
Modeller single templates
1PHZ 0.912 0.891 0.790 0.544
1TOH 0.914 0.868 0.717 0.893
2V27 0.681 0.575 0.422 1.224
Modeller multiple templates
2PHM, 1TOH 0.928 0.904 0.785 0.751
2PHM, 1VKJ 0.913 0.886 0.768 0.601
1VKJ, 3VH0 0.166 0.075 0.046 20.423



Single Templates

<figure id="fig:1phzdope">

dopescore per residue for template and model produced by the scripts from the tutorial for 1toh


Modeling with a high similarity is considered easy, but especially in those cases errors can have more weight than in other. For the model in <xr id="fig:1phzdope"/>, the dopescore per residue of template and target is in all cases higher for the model, but the difference is never high. Like mentioned above, both ends of the model are only existent in the model due to smaller sequenceparts in the pdb.

<figure id="fig:1tohdope">

dopescore per residue for template and model produced by the scripts from the tutorial for 1toh


As expected, the scores are better if the template is more similar to the target, but the difference in scores between the highly similar template and the moderately similar model is very small. Looking at the models on the other hand reveals that the models both are near the optimum, but the one from the highly similar template is more accurate in the (most important) functional domain than the scores suggest. The model from 2V27 with low similarity does very poorly in comparison, which is reflected in the scores. A TM-Score of 0.681 puts the proteins in the same fold, but suggests room for error that can indeed be found in the overlay in <xr id="fig:2v27modelling" /> The dopescore in <xr id="fig:2v27dope" /> also shows room for improvement of the model.

<figure id="fig:2v27dope">

dopescore per residue for template and model produced by the scripts from the tutorial for 2v27


Multiple aligned templates

Using two templates with high similarity produced the model with narrowly the best scores of all methods. We expected good results from this approach but are astonished, that it beats even the much more sophisticated and computationally expensive web servers.

Using one highly and one barely similar template (2PHM and 1VKJ) achieves very good results. We assume that the more reliable information of the better template is used preferentially, as it simply fits the target sequence much better. If there are only low quality template as input, the resulting model is even worse than with one low quality template alone. Possibly conflicting and unreliable information leads to a barely above random modelling of the target.

Swiss Model

The Swiss Model was not able to produce models working from templates with low similarity, which makes this a poor tool for modelling proteins from little known or unknown folds. Overall, the models are on par with the ones produced from Modeller with single templates. Aligning the template prior to the modelling did not change the results significantly, but since we did this only with highly similar sequences, it can be assumed that the automatic alignment done by Swiss Model is almost identical to ours.


I-Tasser provides five different models which can be used and their C-Score to evaluate them, but in real cases there wouldn't be a reference to compare them with to choose the best model. For the pictures, we choose simply the (suggested) model with the best score, but large differences in the models built from one template might make it difficult to evaluate the accuracy of the prediction. To get a feeling about this, we imposed all the models we derived from I-Tasser, to see the conserved and the unsure regions.
We always show the five models colored automaticly by pymol, since the models do not need to be adressed separately, but rather as a set of models. But the reference will be colored in the same blue color as before.


<figure id="fig:automaticmodels">

all five models gained from I-Tasser automatic (randomly colored) superimposed with 1pah in blue


So we see in <xr id="fig:automaticmodels" /> that the catalytic domain is well conserved and the differences between the models are minor, the only big difference between them is the positioning of the bindingdomain, which can be seen at the right side of the picture.


<figure id="fig:1tohmodels">

all five models gained from I-Tasser with template 1toh (randomly colored) superimposed with 1pah in blue


In <xr id="fig:1tohmodels" /> we also see a rather conserved catalytic domain, with more differences than in <xr id="fig:automaticmodels"/>. Additionally we also have some coiled region, which do not match the reference.


<figure id="fig:1vkjmodels">

all five models gained from I-Tasser with template 1vkj (randomly colored) superimposed with 1pah in blue


In the modelling part we were surprised about the extremly well results I-Tasser produced with its modell, but when we now look at <xr id="fig:1vkjmodels" /> we see that this was just due to the choice of the one model with the best score of the five I-Tasser provided.
We see just a fuzzy ball, with no conserved regions throughout the models. Most of them barely match the reference structure. So the extremly good result even with a low quality template has to be dampened, but human choice allows for a decent model building.