Difference between revisions of "Homology based structure prediction (Phenylketonuria)"

From Bioinformatikpedia
(Summary)
(Modeller)
Line 9: Line 9:
   
 
=== Modeller ===
 
=== Modeller ===
Modeller is a tool for homology or comparative modeling of protein 3D-structure. It uses a provided alignment of sequence to be modeled with known related structures and calculates automatically a model. Modeller has a lot of functions available, like including de novo modeling of loops, clustering and multiple alignment of protein sequences or structures etc. <ref> http://salilab.org/modeller/about_modeller.html, retrieved June 08, 2013. </ref> However, many users find it difficult to use MODELLER as it is based on the command line and Python scripting skills are required to use this tool efficiently. <ref> Bhusan K Kuntal, Polamarasetty Aparoy and Pallu Reddanna(2010): "[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936912/ EasyModeller: A graphical interface to MODELLER]". BioMed Central (BMC) Research Notes Vol.3: 226. [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://www.biomedcentral.com/1756-0500/3/226 10.1186/1756-0500-3-226]. PubMed: 20712861]. </ref> So, in the year 2011 some students wrote a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Using_Modeller_for_TASK_4 basic tutorial] for MODELLER.
+
Modeller is a tool for homology or comparative modeling of protein 3D-structure. It uses a provided alignment of sequence to be modeled with known related structures and calculates automatically a model. Modeller has a lot of functions available, like including de novo modeling of loops, clustering and multiple alignment of protein sequences or structures etc. <ref> http://salilab.org/modeller/about_modeller.html, retrieved June 08, 2013. </ref> However, many users find it difficult to use Modeller as it is based on the command line and Python scripting skills are required to use this tool efficiently. <ref> Bhusan K Kuntal, Polamarasetty Aparoy and Pallu Reddanna(2010): "[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936912/ EasyModeller: A graphical interface to MODELLER]". BioMed Central (BMC) Research Notes Vol.3: 226. [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://www.biomedcentral.com/1756-0500/3/226 10.1186/1756-0500-3-226]. PubMed: 20712861]. </ref> So, in the year 2011 some students wrote a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Using_Modeller_for_TASK_4 basic tutorial] for Modeller.
   
 
==== Single template modeling ====
 
==== Single template modeling ====

Revision as of 11:45, 12 June 2013

Summary

The 3D structure of proteins is very important for many biological experiments. However, the number of known protein structures (91190 Structures on PDB <ref> RCSB Protein Data Bank, retrieved on Tuesday June 04, 2013 at 5 PM PDT </ref>) is very low in comparison to the number of known protein sequences (540261 Sequences of UniProtKB/Swiss-Prot<ref> http://web.expasy.org/docs/relnotes/relstat.html, Release 2013_06 of 29-May-13 of UniProtKB/Swiss-Prot contains 540261 sequence entries, comprising 191876607 amino acids abstracted from 219534 references. </ref> and 35502518 of UniProtKB/TrEMBL <ref> http://www.ebi.ac.uk/uniprot/TrEMBLstats/, Release 2013_06 of 29-May-2013 of UniProtKB/TrEMBL contains 35502518 sequence entries, comprising 11384440438 amino acids.</ref> on Uniprot). Since the quantity of folds in nature is limited and the 3D structure of proteins is more conserved than the appendant sequence, it is often possible to find a homologous protein with a known structure (template) for an interesting protein sequence (target). So, homology modelling is one of the best methods to predict 3D structure of proteins depending on their sequence<ref name="swissmodel"> Konstantin Arnold, Lorenza Bordoli, Jürgen Kopp and Torsten Schwede (2006): "The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling". Bioinformatics Vol.22(2): 195-201. doi:10.1093/bioinformatics/bti770 </ref>.

After predicting the structural and functional features of our protein as well as relationships to other proteins via structural alignments, we now want to generate an homology modelling for the secondary structure prediction of our protein. This is an important function, since in real life analysis the 3D structure of proteins is often unknown. For this purpose, we used for the model creation the three tools Modeller, Swissmodel and iTasser with different input settings. We also created some multiple template alignments with the tool Modeller. Then, we evaluated our generated models with the resulted scores from the modelling tools and the additional calculated GDT scores, C_alpha RMSDs and the all atom RMSDs. Last but not least, we will compare and discuss our results in more detail.

Model calculation

Lab journal
In our last task we already found some protein structures we now can discover a bit more. Therefore we first build a dataset with four structures with a sequence identity bigger than 60% and another also with four structures and a sequence similarity between 19 and 36% relative to our reference structure 2pah.

Modeller

Modeller is a tool for homology or comparative modeling of protein 3D-structure. It uses a provided alignment of sequence to be modeled with known related structures and calculates automatically a model. Modeller has a lot of functions available, like including de novo modeling of loops, clustering and multiple alignment of protein sequences or structures etc. <ref> http://salilab.org/modeller/about_modeller.html, retrieved June 08, 2013. </ref> However, many users find it difficult to use Modeller as it is based on the command line and Python scripting skills are required to use this tool efficiently. <ref> Bhusan K Kuntal, Polamarasetty Aparoy and Pallu Reddanna(2010): "EasyModeller: A graphical interface to MODELLER". BioMed Central (BMC) Research Notes Vol.3: 226. doi:10.1186/1756-0500-3-226. PubMed: 20712861]. </ref> So, in the year 2011 some students wrote a basic tutorial for Modeller.

Single template modeling

</figure> </figure> </figure>
<figure id="mod_1j8u">
Structure of target 2PAH (green), model (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool Modeller was used.
<figure id="mod_2phm">
Structure of target 2PAH (green), model (purple) and template 2PHM (blue) superimposed. For the creation of the model the tool Modeller was used.
<figure id="mod_3luy">
Structure of target 2PAH (green), model (purple) of template 3LUY (blue) superimposed. For the creation of the model the tool Modeller was used.

2PAH shows the 3D structure of the protein P00439 residues 118-452. Therefore, only the biopterin domain is included not the ACT domain. It includes the catalytic domain. This is the part which gets aligned with the other structures in the figures above.
A quite good structure consensus was reached with template 1J8U which was expected as it has a sequence identity of 100% to 2PAH (<xr id="mod_1j8u"/>). It almost matches perfectly with the catalytic domain and even forms the binding site. Only at the end is the structure not coiled any more and differs from 2PAH.
The model based on 2PHM with a sequence identity of 89.7% also was aligned into the catalytic domain with quite good results, although it is a bit worse than the model generated with 1J8U (<xr id="mod_2phm"/>. The binding site is visible, but smaller and therefore more difficult to access.
For the structure build with template 3LUY, which only has a sequence identity of 22%, a lot of differences to the target structure can be viewed (<xr id="mod_3luy"/>). The binding site is completely missing, but again the structure is linked to the binding site of 2PAH. In the first two models the templates themselves are matched quite good to 2PAH, whereas for the third model the template is not very well aligned.

<figtable id="modeller_single_zscore">

Sequence identity and combined Z-score of the models generated with Modeller
Template Sequence Identity Model
1J8U 100.00 -11.54
2PHM 93.30 -12.37
3LUY 22.09 -1.24
The templates and their sequence identities as well as the combined Z-Score for each one.

</figtable> Looking at <xr id="modeller_single_zscore"/> you can see that for higher sequence identities lower Z-Scores are achieved. Remarkably for the model based on 2PHM it is lower than for 1J8U although the sequence identity is a bit lower. This might be caused by the uncoiled end (<xr id="mod_1j8u"/>). TODO Z-score good or bad???

Multiple alignments

For multiple alignments we choose four structures and then superimpose them with 2PAH three times. At first we took four structures with sequence identities bigger than 60% (1J8U, 2PHM, 1QEY and 1PHZ). The second time we use two of those sequences (1J8U and 2PHM) and join them with two structures with a sequence identity beneath 30% (3LUY and 1WYP). Last, four structures with sequence identities of about 30% are aligned (3LUY, 1WYP, 2V27 and 2QMX) to each other.

</figure> </figure> </figure>
<figure id="mult_big">
Structure of target 2PAH (green) and model of the multiple alignment of templates 1J8U, 2PHM, 1QEY and 1PHZ (purple) superimposed. For the creation of the model the tool Modeller was used.
<figure id="mult_mix">
Structure of target 2PAH (green) and model of the multiple alignment of templates 1J8U, 2PHM, 3LUY and 1WYP (purple) superimposed. For the creation of the model the tool Modeller was used.
<figure id="mult_low">
Structure of target 2PAH (green) and model and the multiple alignment of templates 3LUY, 1WYP, 2V27 and 2QMX (purple) superimposed. For the creation of the model the tool Modeller was used.

For the first alignment the structure matches nearly perfectly with the catalytic domain of 2PAH (<xr id="mult_big"/>). Only a small part is not aligned. The mixed alignment also has reached a pretty good result (<xr id="mult_mix"/>) and is nearly identical to the first one. Both show a high accordance at the binding site. Only the alignment of the four structures with low sequence identities could not match very well to 2PAH (<xr id="mult_low"/>). In fact it seems that it does not form any structure at all and just remains uncoiled even at the small part aligned to the catalytic domain and forms no binding site.

Alltogether Modeller seems to be a good tool to perform a structure prediction for sequences with high identities for which structure already is known. Especially for multiple alignments good results can be achieved as it seems to be able to filter bad parts of structures if both sequences with high and low identities are combined.

Swissmodel

Another tool for protein structure homology modelling is the integrated web-based SwissModel workspace system. For the interesting protein (target), a library of protein structures is searched to find suitable templates. Based on a sequence alignment between target sequence and template structure, a 3D-model for the target protein is generated. For the reliability estimation of the generated models some quality assessment tools are used. <ref name="swissmodel2"> Lorenza Bordoli, Florian Kiefer, Konstantin Arnold, Pascal Benkert, James Battey and Torsten Schwede (2009): "Protein structure homology modeling using SWISS-MODEL workspace". Nature Protocols Vol.4: 1-13. doi:10.1038/nprot.2008.197 </ref>

The resulting models as well as the reference protein 2PAH and the used templates can be seen in <xr id="swiss_1j8u"/>, <xr id="swiss_2phm"/> and <xr id="swiss_3luy"/>.

</figure> </figure> </figure>
<figure id="swiss_1j8u">
Structure of target 2PAH (green), model (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool Swissmodell was used.
<figure id="swiss_2phm">
Structure of target 2PAH (green), model (purple) and template 2PHM (blue) superimposed. For the creation of the model the tool Swissmodell was used.
<figure id="swiss_3luy">
Structure of target 2PAH (green), model (purple) and template 3LUY (blue) superimposed. For the creation of the model the tool Swissmodell was used.

Like in Modeller the model generated with template 1J8U reaches a good match with 2PAH even it is not as good. Remarkably this time not only the catalytic domain was matched (<xr id="swiss_1j8u"/>). However, for 2PHM which still has a sequence identity of about 90% to 2PAH it is much worse and seems to be random (<xr id="swiss_2phm"/>) and again the 3LUY generated model shows barely a convincing superimposed structure (<xr id="swiss_3luy"/>).

Besides of the structure Swiss-Model also gives: <figure id="anolea">

a ANOLEA and GROMOS results for 1J8U calculated by Swiss-Model.
b ANOLEA and GROMOS results for 2PHM calculated by Swiss-Model.
c ANOLEA and GROMOS results for 3LUY calculated by Swiss-Model.

ANOLEA and GROMOS results for 1J8U (a), 2PHM (b) and 3LUY (c) calculated by Swiss-Model. </figure>

Anolea The atomic empirical mean force potential ANOLEA (Melo et al.) is used to assess packing quality of the models. The program performs energy calculations on a protein chain, evaluating the "Non- Local Environment" (NLE) of each heavy atom in the molecule. The y-axis of the plot represents the energy for each amino acid of the protein chain. Negative energy values (in green) represent favourable energy environment whereas positive values (in red) unfavourable energy environment for a given amino acid.

Gromos GROMOS (van Gunsteren et al.) is a general-purpose molecular dynamics computer simulation package for the study of biomolecular systems and can be applied to the analysis of conformations obtained by experiment or by computer simulation. The y-axis of the plot represents the energy for each amino acid of the protein chain. Negative energy values (in green) represent favourable energy environment whereas positive values (in red) unfavourable energy environment for a given amino acid.

Swiss-Model:

  • 1J8U:
    • Ligands in the template: FE2: 2, H4B: 2.
    • Ligands in the model: FE2: 2
  • 2PHM:
    • Ligands in the template: FE: 2.
    • Ligands in the model: FE: 2
  • 3LUY:
    • Ligands in the template: PPY: 1.
    • Ligands in the model: none.


I-TASSER

I-TASSER is a webserver for a hierarchical protein structure modeling approach. It is based on the Profile-Profile threading Alignment (PPA) and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program. I-TASSER has a new confidence score (C-score) included, adapted from the output values of PPA and Monte Carlo simulation. The output of I-TASSER includes five predicted models and their related C-score as well as only for the first model the estimated TM-score and RMSD value. Although, TM- and C-score possess a significant correlation, they were introduced for different purposes. The C-score gives a judgement about the confidence of the server prediction and the TM-score the absolute quality of the final model in comparison to the native structure (estimated through the C-score calculation). <ref name="itasser"> Yang Zhang (2008): "[zhanglab.ccmb.med.umich.edu/papers/2008_2.pdf I-TASSER server for protein 3D structure prediction]". BMC Bioinformatics Vol.9: 40. doi:10.1186/1471-2105-9-40 </ref>

I-TASSER generates structure and function prediction with following steps:

<figure id="itasser_protocol">

Schematic representation of the steps used in I-TASSER for the protein structure and function prediction.

</figure>

  1. Threading: Retrieve template proteins of similar fold as the query protein sequence from PDB by the locally installed meta-threading approach LOMETS
  2. Structure assembly: continous fragments in the threading alignments are cut off from the template structure and used for merging structural conformations of well aligned sections with unaligned sections built by ab initio modelling.
  3. Model selection and refinement: second iteration of the fragment assembly simulation starting from the selected cluster centroids to remove some inconsistencies and to generate the final structural models
  4. Structure-based functional annotation: the function of the interesting protein is derived by the structural matching of the 3D models against proteins with known structure and function in PDB <ref name="itasser2"> Ambrish Roy, Alper Kucukura and Yang Zhang (2010): "I-TASSER: a unified platform for automated protein structure and function prediction". Nature Protocols Vol.5: 725-738. doi:10.1038/nprot.2010.5 </ref>

A description about I-TASSER as well as the graphical protocol shown in <xr id="itasser_protocol"/> can be found here.

For the modelling with I-TASSER we used the following different cutoff values:

  • for 1J8U we did not set any cutoff since the sequence has an identity of 100%
  • for 2PHM we used a cutoff of 90% (all templates are excluded, which have a sequence identity >90% to the query protein)
  • for 3LUY we used a cutoff of 35% (all templates are excluded, which have a sequence identity >35% to the query protein)

In the <xr id="itasser_1j8u"/>, <xr id="itasser_2phm"/> and <xr id="itasser_3luy"/> below, one can see the superimposed structures of 2PAH and the first generated model resulting from I-TASSER for the query P00439 and the templates 1J8U, 2PHM and 3LUY.

</figure> </figure> </figure>
<figure id="itasser_1j8u">
Structure of target 2PAH (green), model 1 (purple) and template 1J8U (blue) superimposed. For the creation of the model the tool I-TASSER was used.
<figure id="itasser_2phm">
Structure of target 2PAH (green), model 1 (purple) and template 2PHM (blue) superimposed. A cutoff of 90% was utilized for the model creation, which was done by the tool I-TASSER.
<figure id="itasser_3luy">
Structure of target 2PAH (green), model 1 (purple) and template 3LUY (blue) superimposed. A cutoff of 90% was utilized for the model creation, which was done by the tool I-TASSER.

An interesting observation is that the model generated with template 1J8U in <xr id="itasser_1j8u"/> is apparently less consistent to the template itself than the model generated with 2PHM in <xr id="itasser_2phm"/>, which overlaps a little bit more. But both models are very good predictions and match a huge range of the reference protein 2PAH, which cannot be said about the <xr id="itasser_3luy"/>. In this image does the model of template 3LUY hardly overlap with the structure of 2PAH and with the template itself, too. But in comparison to the other two tools is this predicted model more precisely.

<figtable id="itasser_cscores">

C-score of the five models generated with I-TASSER
Template Sequence Identity Model 1 Model 2 Model 3 Model 4 Model 5
1J8U 100.00 -0.10 -1.16 -3.72 -4.08 -4.19
2PHM 93.30 -0.04 -1.12 -0.62 -1.17 -1.32
3LUY 16.00 -3.09 -3.42 -3.57 -3.63 -3.58

</figtable>

In <xr id="itasser_cscores"/> one can discover, that the C-score is getting lower from Model 1 to 5. This means, that the lower the C-score gets, the lower the confidence of the predicted models will be. But it is interesting, that the models generated with the template 2PHM with a sequence identity of 93.30% get better C-scores than the models generated with the template 1J8U with a 100% sequence identity. Moreover, the fourth and fifth model of template 3LUY with 16% identity have a better C-score than these models of template 1J8U. Inter alia, a lower number of the model does not immediately mean that it has to be the best model at all. In our term, Model 1 is the best one generated with I-TASSER, but in the models generated with 2PHM one can recognize for example, that the third model has a better score than the second one.

Model evaluation

Lab journal
For the evaluation we only used the first generated model of the three tools.

Scores

GDT-TS possible to calculate with LGA. <figtable id="eval_scores">

Model-Scoring
Template Seq.Identity GDT_TS Weighted RMSD Un-weighted RMSD
Swiss-Model
1J8U 100.00 88.98 0.51 0.63
2PHM 91.93 88.45 0.59 0.90
3LUY 16.79 4.10 13.52 9.64
iTasser
1J8U 100.00 86.47 0.50 0.62
2PHM 93.30 90.05 0.87 2.66
3LUY 16.00 60.11 1.30 6.10
Modeller - Single template modeling
1J8U 100.00 89.89 0.50 0.62
2PHM 93.30 88.53 0.65 1.67
3LUY 22.09 7.30 21.06 19.61
Modeller - Multiple alignments
1J8U,2PHM,1QEY,1PHZ >60 89.29 0.52 0.73
1J8U,2PHM,3LUY,1WYP ... 89.67 0.53 0.82
3LUY,1WYP,2V27,2QMX <30 7.83 17.22 15.51
Sequence identities, GDT-TS, weighted and unweighted RMSD scores for the different structure model tools.

</figtable> In <xr id="eval_scores"/> one can see, that the RMSD value (weighted and un-weighted) is getting higher the lower the sequence identity of the models, generated with the given templates, gets. In contrast, the GDT value increases at the same observation. The RMSD measures the average of the distance between the atoms of the superimposed structures and the GDT searches the largest set of equivalent residues and deviates them by a specified distance cutoff (mostly four ones). Thus, the GDT takes only regions into account where residues show a low distance to each other. So, the GDT is better for finding the best fitting model. In comparison to the quality scores of the modelling tools, the Z-score of Modeller in <xr id="modeller_single_zscore"/> seems to be like the RMSD score, since it does getting lower the higher the identity of the templates gets. But the Z-score has a negative factor and the RMSD a positive one and therefore the range of these two scores has a high difference. The C-score from the I-TASSER tool in contrast to the Z-score is more similar to the GDT values. The lower the C-score is the higher the seqeunce identity of the templates will be. Here, the C-score has also negative values and the GDT positive ones. So, the correlation between Z-score and C-score is the same as between GDT and RMSD.

If one compares the output of the three tools, one can recognize immediately that I-TASSER creates better models as Modeller and this in return better than Swissmodel. But I-TASSER is very slow and needs a lot of time (until 60 hours for one job). One good thing of this tool in return to te other ones is the output of five models and the calculated values ​​of several models. In Modeller and just a little bit in I-TASSER does the prediction depend on the similarity of the templates, because the higher the similarity gets the more overlapping is the structure of the model with the reference structure. Only Swissmodel does not have any dependency, since all generated models have a bad conformance to 2PAH.

For improving the generated models, one could use other predictions like disorder, transmembrane helices or other structure predictions.

In the models created with Modeller and multiple alignment, one can see, that the model generated from both (low and high sequence identity templates) shows quite good results like the model from only high sequence identity templates. So, it is a good thing to include more than one template, since the created models show a higher similarity to the reference structure.

References

<references/>