Homology-based structure prediction (PKU)

From Bioinformatikpedia
Revision as of 23:02, 3 June 2012 by Hollizeck (talk | contribs) (Single Template)

Short Task Description

After the sequence based predictions of function and secondary structure for our protein we will determine the 3D structure of the wild type protein and observe the influence one or several SNPs have on this structure. Of the variety of methods to be used for tertiary structure prediction, we choose homology modeling as a first approach to our goal. Read the complete task description here. The protocol of commands and scripts can be found in our journal

Reference

<figure id="fig:1pahstruct">

The Phenylalaininehydroxylasemonomer

</figure><figure id="fig:2pahstruct">

The Phenylalaininehydroxylasetetramere, this is the active polymere of the reference protein P00439

</figure>

Due to our prior knowledge of the protein responsible for PKU, the evaluation of the methods applied, is easier than for a completely unknown sequence. In <xr id="fig:1pahstruct" /> one can see the monomer and the active site of Phenylalaninehydroxylase. On the other side ( <xr id="fig:2pahstruct" />) one can see the polymere in its active form which can be found in the human body.

Model Construction

Here we will show the steps we took building the models we then use and evaluate. In order to start the sheer model-building we first have to construct some datasets, which will be the founding of our models.

Datasets

These datasets were derived from several sources. They all consist of PDB-entries, but we ensured to no include the already known structure of our protein, so we have a better insight in the topic of homology modeling with a completely unknown sequence.

PDBe

<figtable id="tab:datasetpdbe"> Dataset PDBe

pdb ID E-value Identity in %
> 80% sequence identity
2phm 4.1e-148 95.5
40% - 80% sequence identity
2xsn 6e-100 61.1
1toh 1e-99 60.8
3e2t 8.5e-99 64.4
1mlw 1.1e-95 66.1
3hf8 1.5e-92 66.4
< 30% sequence identity
3l0i 6.7 25
3uan 18 24.8
1vkj 20 24.8
3hv0 71 21.7

</figtable>

For this set of datasets we used the webservice of sequence similarity search provieded by the pdb called PDBeXplore, which can be accessed here. In the used dataset (see <xr id="tab:datasetpdbe" /> we restricted the received data from pdb, such as we didnt use the structure of both the monomer and the dimer etc. We also did not use the structure with different ligands in order to keep the variability high.
In the dataset of sequences above 80% we only found one significant hit, which is the structure for Phenylalanine Hydroxilase dephosphorylated. This is a marginal case for the noninclusion of the protein itself, but we decided, since its from another organism, that we include it.
The dataset with sequenceidentity from 40% to 80% sequenceidentity only contain structures in connection with aromatic hydroxylation namely Tryptophan and Tyrosin from chicken and rat though the structure gained from the rat also contains the tetramerisation domain we also find in our reference structure. But we also found Tryptophan and Tyrosin hydroxylase structures in the pdb derived from human.
As for the lower than 30% dataset, we can not really expect to find usefull output here, because the best E-value we could find is 6.7.


HHPred

<figtable id="tab:datasetHHPred"> Dataset HHPred

pdb ID E-value Identity in %
> 80% sequence identity
1phz 1.5e-159 92
1j8u 2.7e-143 100
40% - 80% sequence identity
1toh 4.9e-147 60
1mlw 4.1e-137 65
< 30% sequence identity
3luy 1.3e-17 21
3mwb 7.1e-15 20
2ko1 0.38 11
1zpv 0.26 12

</figtable>

The dataset with the highest sequencesimilarity in <xr id="tab:datasetHHPred"/> contains two structures with a very high similarity, with is due to the fact, that the structure is that of the original protein in different states. One is the protein in complex with Tetrahydrobiopterin (BH4), which is a co-factor for the PheOH-activity. The other is the phosphorylated proteinstructure.
In the second dataset (40%-80%) we find two of the structures which were alread explained above.
The last dataset from HHPred contains five structures which only decend from bacteria with only one of the structures has a direct connection with PheOH as this one binds L-Phe. The others all are connected or part of the ACT-domain which is known to be controlled by amino-acid concentration, which relates to our target protein.


Coma

<figtable id="tab:datasetComa"> Dataset Coma

pdb ID E-value Identity in %
> 80% sequence identity
1phz 3.4e-126 87
40% - 80% sequence identity
NO RESULT
< 30% sequence identity
2v27 1.9e-79 28
1ltu 6.9e-77 27
3luy 2.6e-21 16
3mwb 2.8e-20 16
2qmx 3.7e-20 20
2qmw 2.2e-19 21

</figtable>

In the above 80% dataset we find again our structure from above.
Unfortunately we did not receive any result for our second dataset.
But the choice for our third dataset was great. We chose the PheOH-counterpart from the CHROMOBACTERIUM VIOLACEUM namely 1ltu and one (2v27) from COLWELLIA PSYCHRERYTHRAEA 34H which is a version of the protein, that works in a much colder environment. 3luy as well as [1] plus the new 2qmx and 2qmw fall into the ACT-domain-group as mentioned at the HHPreddataset

Comparison of datasets

In summary one can say, that the approaches all find similar results for the dataset with 80% sequence similarity or above. Those are all structures of Phenylalaninehydroxylase with different modifications or from different organisms. But only Coma and HHPred find the same (1phz) structure, whereas PDBeXplore finds a completely different identifier, which is the protein with its co-substrate.
Most differences occur in the second dataset, in which PDBeXplore finds a lot of possible candidates with a very good e-Value range. But the other two do only find two or even no result at all.
But in the third datset (<30% identity) PDBeXplore finds some candidates, but the e-Values are to high too be considered good hits. In contrary to HHPred and Coma which both found good hits with a low e-Value as well as identity in the desired range.
What also might have been observed by an interested reader, are the differences in identity and e-Value throughout the supposed to be identical hits, like 3luy. We are not entirely sure where these might arise, but since the difference is not that significant we expect them to descend from different alignment scores and or weighting.

Acknowledgment

Parsing of HHPred and Coma was performed with help from the scripts of group Fabry Disease. Thank you for your inspiration.

Models

IN this part we create homology models with different methods, in order to examine the structure of our unknown protein.

Modeller

Here we will show you the models one can gain from Modeller <ref name="modeller">A. Šali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993. </ref>. We used a local version at home.
Modller offers mainly two possibilities:

  • single template modelling
  • multi-template modelling

We are going to show you the differences and possibilities this offers.

Single Template

In this part you choose one sequence which you believe is the closest relative to your target sequence and model with the alignment of those two. As we did some dataset creation before, the choice of sequences is already done. We now only have to use single template to first align and then actually model this pairwise alignment. Then it will be assessed with DOPE-score and GA341<ref name="GA341">Francisco Melo, Roberto Sánchez, Andrej Sali; Protein science : a publication of the Protein Society, Vol. 11, No. 2. (February 2002), pp. 430-448, doi:10.1002/pro.110430</ref><ref name="DOPE">Min-yi Shen and Andrej Sali Protein Sci. 2006 November; 15(11): 2507–2524.doi: 10.1110/ps.062416606</ref>.
In the following we show you three exemplary models from each of the sequence identity parts.

1phz

To see the optimal quality one can gain from a modeling we used 1phz which we expected to have the highest similarity to 1pah. In <xr id="fig:1phzmodelling" /> one sees the simliarity to <xr id="fig:1pahstruct" />

<figure id="fig:1phzmodelling">

Modelling of 1pah (blue) with 1phz (red) overlayed

</figure>

One sees a clear similarty but the coiled ends and beginnings which are not mentioned in the pdb-files, since the pdbs always use a shorter sequence than found in UNiProt

2xsn

With the lower sequence identity we used the one with the highest e-Value in order to see the optimal performance here. the colorcoding is the same as above.

<figure id="fig:2xsnmodelling">

Modelling of 1pah with 2xsn (blue) and the reference 1pah (red) overlayed

</figure>

2v27

With a sequenceidentity as low as in this model one would expect very bad results. Our results can bee seen in <xr id="fig:2v27modelling" />

<figure id="fig:2v27modelling">

Modelling of 1pah with 2v27 (blue) and the reference 1pah (red) overlayed

</figure>


Multi Template

We created three models using multiple templates. The procedure follows the tutorial from previous students

Used templates are:

  • 1VKJ and 3HV0

<xr id="tab:modeller_multi"/> on the left shows the overlay of a model created with two templates of low (<30%) sequence similarity in blue with the true structure in red. As can be seen, much of the model is rather randomly folded, especially the loop formed by residues aproximately 320 to 380 is badly misshapen.

  • 1VKJ and 2PHM

<xr id="tab:modeller_multi"/> in the middle shows the overlay of a model created with one templates of low (<30%) and one with high (<80%) sequence similarity in blue with the true structure in red. The N-terminal helix is not modelled correctly and one larger stretch in the bottom right corner follows is missing in the reference structure. The remaining parts are modelled very close to the true structure.

  • 2PHM and 1TOH

<xr id="tab:modeller_multi"/> on the right shows the overlay of a model created with two templates of high (one >80% and one >60%) sequence similarity in blue with the true structure in red. The errors are much the same as previously in <xr id="tab:modeller_multi"/> in the middle: The N-terminal helix is folded correctly but kinked, and a longer stretch is missing the reference.


<figtable id="tab:modeller_multi"> Modeller predictions with multiple templates

Overlay of modeller prediction of PAH using 1VKJ and 3HV0 PDB structures as template in blue with the 2PAH PDB structure in red
Overlay of modeller prediction of PAH using 1VKJ and 2PHM PDB structures as template in blue with the 2PAH PDB structure in red
Overlay of modeller prediction of PAH using 1TOH and 2PHM PDB structures as template in blue with the 2PAH PDB structure in red

</figtable>

Swiss-Model

We submitted several templates to the Swiss Model server, the resulting models are shown in <xr id="tab:Swiss_Model"/>.

  • Fully automatic prediction

The server choose the best template on its own and used the PDB structure 1J8U:A. The N-terminal helix is again not modelled correctly, the remaining parts are modelled very close to the true structure.

  • 1TOH:A

Feeding 1TOH manually as template results in the models seen in <xr id="tab:Swiss_Model"/> middle and right. The terminal helix is modelled correctly but kinked again. Aligning the template with T-Coffee did not change the result significantly as the automatic alignment appears already very accurate.

We also tried a number of other structures with lower sequence similarity as templates, both with and without a prealigned sequence. In each case, the server aborted the prediction for reasons we summarize as too low alignment quality.


<figtable id="tab:Swiss_Model"> Swiss Model server predictions with and without prealigned templates

Overlay of automatic Swiss Model server prediction of PAH in blue with the 2PAH PDB structure in red
Overlay of automatic Swiss Model server prediction of PAH using the 1TOH PDB structure as template in blue with the 2PAH PDB structure in red
Overlay of automatic Swiss Model server prediction of PAH using the 1TOH PDB structure as prealigned template in blue with the 2PAH PDB structure in red

</figtable>

I-Tasser

submitted the following templates:

  • full automatic (chosen by server: 2PHM:A)
  • 1TOH:A (complete)
  • 1VKJ:A (pending)


<figtable id="tab:I-Tasser"> I-Tasser server predictions

Overlay of automatic I-Tasser server prediction of PAH in blue with the 2PAH PDB structure in red
Overlay of automatic I-Tasser server prediction of PAH using the 1TOH PDB structure as template in blue with the 2PAH PDB structure in red

</figtable>

Modelevaluation

A monk once asked Ummon, "What is this place where knowledge is useless?" Ummon answered him: "Knowledge and emotion cannot fathom it!"

Modeller

1phz

2xsn

2v27

Will probably be completed by Monday evening, sorry! :-/

References

<references/>