Homology based structure prediction (Phenylketonuria)

Summary

The 3D structure of proteins is very important for many biological experiments. However, the number of known protein structures (91190 Structures on PDB <ref> RCSB Protein Data Bank, retrieved on Tuesday June 04, 2013 at 5 PM PDT </ref>) is very low in comparison to the number of known protein sequences (540261 Sequences of UniProtKB/Swiss-Prot<ref> http://web.expasy.org/docs/relnotes/relstat.html, Release 2013_06 of 29-May-13 of UniProtKB/Swiss-Prot contains 540261 sequence entries, comprising 191876607 amino acids abstracted from 219534 references. </ref> and 35502518 of UniProtKB/TrEMBL <ref> http://www.ebi.ac.uk/uniprot/TrEMBLstats/, Release 2013_06 of 29-May-2013 of UniProtKB/TrEMBL contains 35502518 sequence entries, comprising 11384440438 amino acids.</ref> on Uniprot). Since the quantity of folds in nature is limited and the 3D structure of proteins is more conserved than the appendant sequence, it is often possible to find a homologous protein with a known structure (template) for an interesting protein sequence (target). So, homology modelling is one of the best methods to predict 3D structure of proteins depending on their sequence<ref name="swissmodel"> Konstantin Arnold, Lorenza Bordoli, Jürgen Kopp and Torsten Schwede (2006): "The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling". Bioinformatics Vol.22(2): 195-201. doi:10.1093/bioinformatics/bti770 </ref>.

After predicting the structural and functional features of our protein as well as relationships to other proteins via structural alignments, we now want to generate an homology modelling for the secondary structure prediction of our protein. This is an important function, since in real life analysis the 3D structure of proteins is often unknown. For this purpose, we used for the model creation the three tools Modeller, Swissmodel and iTasser with different input settings. We also created some multiple template alignments with the tool Modeller. Then, we evaluated our generated models with the resulted scores from the modelling tools and the additional calculated GDT scores, C_alpha RMSDs and the all atom RMSDs. Last but not least, we will compare and discuss our results in more detail.

Model calculation

Lab journal
In our last task we already found some protein structures we now can discover a bit more. Therefore, we first build a dataset with four structures with a sequence identity bigger than 60% and another one also with four structures and a sequence similarity between 19 and 36% relative to our reference structure 2PAH.

Modeller

Modeller is a tool for homology or comparative modeling of protein 3D-structure. It uses a provided alignment of sequences to be modeled with known related structures and calculates automatically a model. Modeller has a lot of functions available, like including de novo modeling of loops, clustering and multiple alignment of protein sequences or structures etc. <ref> http://salilab.org/modeller/about_modeller.html, retrieved June 08, 2013. </ref> However, many users find it difficult to use Modeller as it is based on the command line and Python scripting skills are required to use this tool efficiently. <ref> Bhusan K Kuntal, Polamarasetty Aparoy and Pallu Reddanna(2010): "EasyModeller: A graphical interface to MODELLER". BioMed Central (BMC) Research Notes Vol.3: 226. doi:10.1186/1756-0500-3-226. PubMed: 20712861]. </ref> So, in the year 2011 some students wrote a basic tutorial for Modeller .

Single template modeling

</figure> </figure> </figure>

Structure of target 2PAH (green), model (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool `Modeller` was used.
<figure id="mod_1j8u">
<figure id="mod_2phm">
<figure id="mod_3luy">

2PAH shows the 3D structure of the protein P00439 residues 118-452. Therefore, only the biopterin domain is included not the ACT domain. It includes the catalytic domain. This is the part which gets aligned with the other structures in the figures above.
A quite good structure consensus was reached with template 1J8U which was expected as it has a sequence identity of 100% to 2PAH (<xr id="mod_1j8u"/>). It almost matches perfectly with the catalytic domain and even forms the binding site. Only at the end the structure is not coiled any more and differs from 2PAH. The model based on 2PHM with a sequence identity of 89.7% also was aligned into the catalytic domain with quite good results, although it is a bit worse than the model generated with 1J8U (<xr id="mod_2phm"/>. The binding site is visible, but smaller and therefore more difficult to access. For the structure build with template 3LUY, which only has a sequence identity of 22%, a lot of differences to the target structure can be viewed (<xr id="mod_3luy"/>). The binding site is completely missing, but again the structure is linked to the binding site of 2PAH. In the first two models the templates themselves are matched quite good to 2PAH, whereas for the third model the template is not very well aligned.

Sequence identity and combined Z-score of the models generated with Modeller
Template	Sequence Identity	Model
1J8U	100.00	-11.54
2PHM	93.30	-12.37
3LUY	22.09	-1.24

The templates and their sequence identities as well as the combined Z-Score for each one.

</figtable> Looking at <xr id="modeller_single_zscore"/> you can see that for higher sequence identities lower Z-Scores are achieved. Remarkably for the model based on 2PHM it is lower than for 1J8U although the sequence identity is a bit lower. This might be caused by the uncoiled end (<xr id="mod_1j8u"/>). Z-scores represent the divergence of the standard deviation to the mean. The further away of zero the more likely it is no random similarity. As 3LUY is close to zero you can see that the accordances between this structure and 2PAH is more likely to be caused by random than for 1J8U or 2PHM.

Multiple alignments

For multiple alignments we choose four structures and then superimpose them with 2PAH three times. At first we take four structures with sequence identities bigger than 60% (1J8U, 2PHM, 1QEY and 1PHZ). The second time we use two of those sequences (1J8U and 2PHM) and join them with two structures with a sequence identity beneath 30% (3LUY and 1WYP). Last, we align four structures with sequence identities of about 30% (3LUY, 1WYP, 2V27 and 2QMX) to each other.

</figure> </figure> </figure>

Structure of target 2PAH (green) and model of the multiple alignment of templates 1J8U, 2PHM, 1QEY and 1PHZ (purple) superimposed. For the creation of the model the tool `Modeller` was used.
<figure id="mult_big">
<figure id="mult_mix">
<figure id="mult_low">

For the first alignment the structure matches nearly perfectly with the catalytic domain of 2PAH (<xr id="mult_big"/>), where only a small part is not aligned. The mixed alignment also has reached a pretty good result (<xr id="mult_mix"/>) and is nearly identical to the first one. Both show a high accordance at the binding site. Only the alignment of the four structures with low sequence identities could not match very well to 2PAH (<xr id="mult_low"/>). In fact it seems that it does not form any structure at all and just remains uncoiled even at the small part aligned to the catalytic domain and forms no binding site.

Alltogether Modeller seems to be a good tool to perform a structure prediction for sequences with high identities for which the structure already is known. Especially for multiple alignments good results can be achieved as it seems to be able to filter bad parts of structures if both sequences with high and low identities are combined.

Swissmodel

Another tool for protein structure homology modelling is the integrated web-based SwissModel workspace system. For the interesting protein (target), a library of protein structures is searched to find suitable templates. Based on a sequence alignment between target sequence and template structure, a 3D-model for the target protein is generated. For the reliability estimation of the generated models some quality assessment tools are used. <ref name="swissmodel2"> Lorenza Bordoli, Florian Kiefer, Konstantin Arnold, Pascal Benkert, James Battey and Torsten Schwede (2009): "Protein structure homology modeling using SWISS-MODEL workspace". Nature Protocols Vol.4: 1-13. doi:10.1038/nprot.2008.197 </ref>

The resulting models as well as the reference protein 2PAH and the used templates can be seen in <xr id="swiss_1j8u"/>, <xr id="swiss_2phm"/> and <xr id="swiss_3luy"/>.

</figure> </figure> </figure>

Structure of target 2PAH (green), model (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool `Swissmodell` was used.
<figure id="swiss_1j8u">
<figure id="swiss_2phm">
<figure id="swiss_3luy">

Like in Modeller the model generated with template 1J8U reaches a good match with 2PAH even it is not as good. Remarkably this time not only the catalytic domain was matched (<xr id="swiss_1j8u"/>). However, for 2PHM which still has a sequence identity of about 90% to 2PAH it is much worse and seems to be random (<xr id="swiss_2phm"/>) and again the 3LUY generated model shows barely a convincing superimposed structure (<xr id="swiss_3luy"/>).

Besides of the structure Swiss-Model also gives ANOLEA and GROMOS values: <figure id="anolea">

a ANOLEA and GROMOS results for 1J8U calculated by Swiss-Model.

b ANOLEA and GROMOS results for 2PHM calculated by Swiss-Model.

c ANOLEA and GROMOS results for 3LUY calculated by Swiss-Model.

ANOLEA and GROMOS results for 1J8U (a), 2PHM (b) and 3LUY (c) calculated by Swiss-Model.

</figure> ANOLEA is the atomic empirical mean force potential and can be used to judge the packing quality of the models. Positive and negative energies are calculated for the whole protein chain. Thereby, the "Non- Local Environment" (NLE) of each heavy atom in the molecule is evaluated. Negative energy values are represented in green and are preferable whereas red shows unfavourable positive energy. In <xr id="anolea"/> for the model of 1J8U (a) mostly for each amino acid negative energy was meassured, for the model of 2PHM (b) it already is a bit less, but still very good, whereas for the model of 3LUY (c) many positive energies are found. GROMOS like ANOLEA represents the positive or negative energy for each amino acid only without evaluation the NLE. Thereby negative energy values (green) are favourable. Therefore the results of GROMOS are not that different to ANOLEA, so the models created with 1J8U and 2PHM show lot of negative energies, wheras the model created with 3luy again has many positive energies. Including NLE in ANOLEA leads to pyromidal negative energy formation if looking not just at one amino acid but also at is neighbors.
Additionally Swiss-Model gives information if the ligands included in the template is used in the model or not:

1J8U:
- Ligands in the template: FE2: 2, H4B: 2.
- Ligands in the model: FE2: 2
2PHM:
- Ligands in the template: FE: 2.
- Ligands in the model: FE: 2
3LUY:
- Ligands in the template: PPY: 1.
- Ligands in the model: none.

Here you can see that some ligands are not used for the model. For 1J8U the ligand H4B was removed and only FE2 remains. For 3LUY with excluding the ligand PPY no ligand was left at all. Only for 2PHM there was no change of the ligand between template and model.

I-TASSER

I-TASSER is a webserver for an hierarchical protein structure modeling approach. It is based on the Profile-Profile threading Alignment (PPA) and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program. I-TASSER has a new confidence score (C-score) included, adapted from the output values of PPA and Monte Carlo simulation. The output of I-TASSER includes five predicted models and their related C-score as well as only for the first model the estimated TM-score and RMSD value. Although, TM- and C-score possess a significant correlation, they were introduced for different purposes. The C-score gives a judgement about the confidence of the server prediction and the TM-score the absolute quality of the final model in comparison to the native structure (estimated through the C-score calculation). <ref name="itasser"> Yang Zhang (2008): "[zhanglab.ccmb.med.umich.edu/papers/2008_2.pdf I-TASSER server for protein 3D structure prediction]". BMC Bioinformatics Vol.9: 40. doi:10.1186/1471-2105-9-40 </ref>

I-TASSER generates structure and function prediction with following steps:

Schematic representation of the steps used in I-TASSER for the protein structure and function prediction.

</figure>

Threading: Retrieve template proteins of similar fold as the query protein sequence from PDB by the locally installed meta-threading approach LOMETS
Structure assembly: continous fragments in the threading alignments are cut off from the template structure and used for merging structural conformations of well aligned sections with unaligned sections built by ab initio modelling.
Model selection and refinement: second iteration of the fragment assembly simulation starting from the selected cluster centroids to remove some inconsistencies and to generate the final structural models
Structure-based functional annotation: the function of the interesting protein is derived by the structural matching of the 3D models against proteins with known structure and function in PDB <ref name="itasser2"> Ambrish Roy, Alper Kucukura and Yang Zhang (2010): "I-TASSER: a unified platform for automated protein structure and function prediction". Nature Protocols Vol.5: 725-738. doi:10.1038/nprot.2010.5 </ref>

A description about I-TASSER as well as the graphical protocol shown in <xr id="itasser_protocol"/> can be found here.

For the modelling with I-TASSER we used the following different cutoff values:

for 1J8U we did not set any cutoff since the sequence has an identity of 100%
for 2PHM we used a cutoff of 90% (all templates are excluded, which have a sequence identity >90% to the query protein)
for 3LUY we used a cutoff of 35% (all templates are excluded, which have a sequence identity >35% to the query protein)

In the <xr id="itasser_1j8u"/>, <xr id="itasser_2phm"/> and <xr id="itasser_3luy"/> below, one can see the superimposed structures of 2PAH and the first generated model resulting from I-TASSER for the query P00439 and the templates 1J8U, 2PHM and 3LUY.

</figure> </figure> </figure>

Structure of target 2PAH (green), model 1 (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool `I-TASSER` was used.
<figure id="itasser_1j8u">
<figure id="itasser_2phm">
<figure id="itasser_3luy">

An interesting observation is that the model generated with template 1J8U in <xr id="itasser_1j8u"/> is apparently less consistent to the template itself than the model generated with 2PHM in <xr id="itasser_2phm"/>, which overlaps a little bit more. But both models are very good predictions and match a huge range of the reference protein 2PAH, which cannot be said about the <xr id="itasser_3luy"/>. In this image the model of template 3LUY hardly overlaps with the structure of 2PAH and with the template itself, too. But in comparison to the other two tools this predicted model is more precisely.

C-score of the five models generated with I-TASSER
Template	Sequence Identity	Model 1	Model 2	Model 3	Model 4	Model 5
1J8U	100.00	-0.10	-1.16	-3.72	-4.08	-4.19
2PHM	93.30	-0.04	-1.12	-0.62	-1.17	-1.32
3LUY	16.00	-3.09	-3.42	-3.57	-3.63	-3.58

</figtable>

In <xr id="itasser_cscores"/> one can discover, that the C-score is getting lower from Model 1 to 5. This means, that the lower the C-score gets, the lower the confidence of the predicted models will be. But it is interesting, that the models generated with the template 2PHM with a sequence identity of 93.30% get better C-scores than the models generated with the template 1J8U with a 100% sequence identity. Moreover, the fourth and fifth model of template 3LUY with 16% identity have a better C-score than these models of template 1J8U. Inter alia, a lower number of the model does not immediately mean that it has to be the best model at all. In our term, Model 1 is the best one generated with I-TASSER, but in the models generated with 2PHM one can recognize for example, that the third model has a better score than the second one.

Model evaluation

Lab journal
For the evaluation we only used the first generated model of the three tools.

Scores

Model-Scoring
Template	Seq.Identity	GDT_TS	Weighted RMSD	Un-weighted RMSD
Swiss-Model
1J8U	100.00	88.98	0.51	0.63
2PHM	91.93	88.45	0.59	0.90
3LUY	16.79	4.10	13.52	9.64
iTasser
1J8U	100.00	86.47	0.50	0.62
2PHM	93.30	90.05	0.87	2.66
3LUY	16.00	60.11	1.30	6.10
Modeller - Single template modeling
1J8U	100.00	89.89	0.50	0.62
2PHM	93.30	88.53	0.65	1.67
3LUY	22.09	7.30	21.06	19.61
Modeller - Multiple alignments
1J8U,2PHM,1QEY,1PHZ	>60	89.29	0.52	0.73
1J8U,2PHM,3LUY,1WYP	...	89.67	0.53	0.82
3LUY,1WYP,2V27,2QMX	<30	7.83	17.22	15.51

Sequence identities, GDT-TS, weighted and unweighted RMSD scores for the different structure model tools.

</figtable> In <xr id="eval_scores"/> one can see, that the RMSD value (weighted and un-weighted) is getting higher the lower the sequence identity of the models, generated with the given templates, gets. In contrast, the GDT value increases at the same observation. The RMSD measures the average of the distance between the atoms of the superimposed structures and the GDT searches the largest set of equivalent residues and deviates them by a specified distance cutoff (mostly four ones). Thus, the GDT takes only regions into account where residues show a low distance to each other. So, the GDT is better for finding the best fitting model. In comparison to the quality scores of the modelling tools, the Z-score of Modeller in <xr id="modeller_single_zscore"/> seems to be like the RMSD score, since it does getting lower the higher the identity of the templates gets. But the Z-score has a negative factor and the RMSD a positive one and therefore the range of this two scores has a high difference. The C-score from the I-TASSER tool in contrast to the Z-score is more similar to the GDT values. The lower the C-score the higher the sequence identity of the templates will be. Here, the C-score has also negative values and the GDT positive ones. So, the correlation between Z-score and C-score is the same as between GDT and RMSD.

If one compares the output of the three tools, one can recognize immediately that I-TASSER creates better models as Modeller and this in return better models than Swissmodel. Nevertheless, I-TASSER is very slow and needs a lot of time (until 60 hours for one job). One good thing of this tool in contrast to the other ones is the output of five models and the calculated values of several models. In Modeller and just a little bit in I-TASSER the prediction depends on the similarity of the templates, because the higher the similarity the more overlapping is the structure of the model with the reference structure. Only Swissmodel does not have any dependency, since all generated models have a bad conformance to 2PAH.

For improving the generated models, one could use other predictions like disorder, transmembrane helices or other structure predictions.

In the models created with Modeller and multiple alignment, one can see, that the model generated from both (low and high sequence identity templates) shows quite good results like the model from only high sequence identity templates. So, it is a good thing to include more than one template, since the created models show a higher similarity to the reference structure.

References

Homology based structure prediction (Phenylketonuria)

Contents

Summary

Model calculation

Modeller

Single template modeling

Multiple alignments

Swissmodel

I-TASSER

Model evaluation

Scores

References

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools