Homology based structure predictions BCKDHA

!!! This task has to be re-done. The template used for the "good" category is a structure of BCKDHA itself, and when running iTasser the self hit was NOT excluded!!!

1.Calculation and evaluation of models

Template selection

Homology modelling is a technique to determine the secondary structure of a target protein. It is based on an alignment of the target sequence and one or more template sequences with known secondary structures. The target sequence is assigned a secondary structure based on the template structure. The better the alignment, the better the predicted secondary structure for our template. Therefore the template selection is a crucial step in homology modelling.

To find similar structures to BCKDHA we ran HHsearch using the following command:
hhsearch -i query -d database -o output

It found the following 10 hits in the pdb70 database.

No	Hit	Prob	E-value	P-value	Score	Cols	Query HMM	Template HMM	Identity
1	2bfd_A 2-oxoisovalerate dehydr	1.0	1	1	791.3	400	1-400	1-400 (400)	99%
2	1qs0_A 2-oxoisovalerate dehydr	1.0	1	1	571.5	349	32-382	52-407 (407)	39%
3	1w85_A Pyruvate dehydrogenase	1.0	1	1	530.8	356	8-382	6-362 (368)	34%
4	1umd_A E1-alpha, 2-OXO acid de	1.0	1	1	521.8	351	34-386	16-367 (367)	37%
5	2ozl_A PDHE1-A type I, pyruvat	1.0	1	1	482.7	331	46-380	25-356 (365)	27%
6	3l84_A Transketolase; TKT, str	1.0	1	1	85.4	133	161-297	113-252 (632)	21%
7	2r8o_A Transketolase 1, TK 1;	1.0	1	1	74.5	121	161-285	113-245 (669)	33%
8	2o1x_A 1-deoxy-D-xylulose-5-ph	1.0	1	1	74.2	127	161-287	122-254 (629)	18%
9	1gpu_A Transketolase; transfer	1.0	1	1	74.2	140	161-302	115-265 (680)	22%
10	3m49_A Transketolase; alpha-be	1.0	1	1	68.8	121	161-285	139-271 (690)	31%

Before we can start working with these hits we have to check whether one of them is a PDB structure for BCKDHA. This is the case for 2bfd_A.
By looking at our results and the fact that this hit can not be used we only have structures with an identity lower than 40%. Since there are just structures available from this region we decided to take two structures out of it. One with a 39% identity and one with 18% identity so that there is still a variation in the identities.
In the following we worked with 1qs0_A (39%) and with 2o1x_A (18%).

General information for the evaluation

A detailed description of how the created models were evaluated can be found in the Evaluation Protocol. The following section presents only the modelling and evaluation results.

Two interesting score when comparing two structures for their structural similarity are the RMSD and the TM-Score. These are two measures which are usually used to measure the accuracy of modelling a structure when the native structure is known.

The RMSD is the average distance of all residue pairs in two structures. The C-alpha RMSD is the average distance between aligned alpha-carbons. The smaller the RMSD value, the better the predicted structure. A local error (e.g. misorientation of the tail) will result in a high RMSD value, although the global structure is correct.

As the RMSD is sensitive to the local error, the TM-Score was proposed. The TM-Score weights close matches stronger than distant matches and therefore the local error problem is overcome. A TM-Score <0.5 indicates a model with random structural similarity, wherease 0.5 < TM-score < 1.00 means the two compared structures are in about the same fold and therefore the predicted model has a correct topology.

Modeller

MODELLER is used for homology or comparative modelling of protein three-dimensional structures. It calculates a model containing all non-hydrogen atoms. There are also many other tasks provided by MODELLER like de novo modelling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc.[1]

A tutorial is provided on [2] and on [3]

To run modeller with more than one template we use the targets (the percentage values indicate the sequence similarity to the target):

3m49:A
2r8o:A
2o1x:A
1w85:A
1qs0:A

Protocol Modeller

Results

Numeric evaluation

template	molpdf	DOPE score	GA341 score
1QS0_A	2650.72876	-40503.06250	1.00000
2O1X_A	2958.02075	-30294.51758	0.41955
3M49_A, 2R8O_A, 2O1X_A, 1W85_A, 1QS0_A	123913.8	-19573.67	0.00147

The DOPE (Discrete Optimized Protein Energy) score is calculated to assess homology models. The lower the value of the DOPE score the better the . This can be also seen in our three models. The first one (2r8o) which has the worst sequence identity has a quite high DOPE score. The model where 2bfd was the template has a very low score which is reasonable since 2bfd had a very high sequence identity. It is interesting that the model which is build with 7 templates has a higher score than the one which is only build with 1bfd. This can be explained by the influence of the templates which have a low sequence identity with 1u5b.

GA341 is calculated to decide wether the result is a good model or not. A model which is quite good has a score near one. When a model has a score lower than 0.6 it is a bad model. This is also reflected by our results. The model with 2r8o as template is not a good model since the sewuence identity was low and also the DOPE score is quite high so it has a GA341 score of 0. This shows that it is a really bad model. The other two models have a GA341 score of one which shows that they are good models.

Comparison to experimental structure

experimental structure	model with template	RMSD (DaliLite)	RMSD (sap)	TM-score
1U5B_A	1QS0_A	2.3	0.829	0.8504
1U5B_A	2O1X_A	3.5	2.727	0.1592
1U5B_A	3M49_A, 2R8O_A, 2O1X_A, 1W85_A, 1QS0_A	no score	11.398	0.1719

C-alpha RMSD is a measure of the average deviation in distance between aligned alpha-carbons. The higher this distance value the worse is the model. The first model using 2r8o as template has no C-alpha RMSD since the programm we used could find enough significant similarities because the structures are to dissimilar. The model build with 2bfd has a C-alpha RMSD score of 1.1. This is a very good score. It is interesting that again the model for 7 template proteins does not have a better score (1.4), although some templates with very high sequence similarity were included. This shows that the templates with low sequence similarity have too much influence on the final model. The model with 2bfd is the best prediction by modeller for our target.

all atom RMSD

position	1qs0	2o1x	multi
161	0.332	6.172	2.607
166	0.668	3.697	3.208
167	0.656	6.759	7.962

To calculate the all atom RMSD we first had to find out where the catalytic centers are located. This can be looked up in UniProt.

Superposition

Figure1: Superimposed structures of 1U5B and the modeller model with template 1QS0

Figure2: Superimposed structures of 1U5B and the modeller model with template 2O1X

Figure3: Superimposed structures of 1U5B and the modeller model with more than one template

SWISS-MODEL

SWISS-MODEL server page

To find protein structure homology models SWISS-MODEL can be used. SWISS-MODEL is a fully automated protein structure homology-modeling server and is accessible via the ExPASy web server, or from the program DeepView (Swiss Pdb-Viewer).
It provides three different modelling modes:

Automated Mode
Alignment Mode
Project Mode

The Automated Mode uses fully automated modelling and can therefore be only used when the template is very similar to the target.<ref>http://swissmodel.expasy.org/?pid=smd03&uid=&token=</ref>
As an Input for the automated mode, only an amino acid sequence (raw or FASTA format) or the Uniprot AC of the target is required. Optional a template PDB id can be given. Swissmodel automatically selects templates from a Blast run which are suitable due to their E-values if no template is given. The Alignment Mode has to be used for the structures with a low identity. Since we only have hits in the region < 40% we used this tool.

Protocol Swissmodel

Results

Numeric evaluation

Global Model Quality Estimation

The following analysis show the global quality of the results of SWISS-MODEL and it also compares the models with the two different structures 1qs0 and 2o1x. For these analysis the QMEAN4 global scores and the local scores are used.

QMEAN4 global scores

1qs0

Comparison of the model with non-redundant set of PDB structures; the red x stands for the Z-score of this model

Density plot for QMEAN scores of the reference set where the red line stands for the QMEAN score of this model

Z-score of the individual components of QMEAN

Scoring function term	Raw score	Z-score
C_beta interaction energy	-153.75	0.05
All-atom pairwise energy	-9712.58	-0.49
Solvation energy	-29.76	-0.69
Torsion angle energy	-32.75	-3.30
QMEAN4 score	0.568	-3.28

2o1x

Comparison of the model with non-redundant set of PDB structures; the red x stands for the Z-score of this model

Density plot for QMEAN scores of the reference set where the red line stands for the QMEAN score of this model

Z-score of the individual components of QMEAN

Scoring function term	Raw score	Z-score
C_beta interaction energy	69.70	-4.21
All-atom pairwise energy	328.76	-4.29
Solvation energy	30.38	-6.23
Torsion angle energy	48.91	-7.04
QMEAN4 score	0.181	-9.89

Local scores

1qs0	Coloring of the 1qs0 modell by residue error	Estimated per-residue in accuracies along the sequence for the 1qs0 model
2o1x	Coloring of the 2o1x modell by residue error	Estimated per-residue in accuracies along the sequence for the 2o1x model

Local Model Quality Estimation: Anolea / QMEAN

1qs0	2o1x
Local Model Quality Estimation with Anolea and QMEAN for the 1qs0 model	Local Model Quality Estimation with Anolea and QMEAN for the 2o1x model

Comparison to experimental structure

experimental structure	model with template	RMSD (DaliLite)	RMSD (sap)	TMscore
1U5B_A	1QS0_A	3.4	0.766	0.8771
1U5B_A	2O1X_A	3.3	14.305	0.1686

C-alpha RMSD is a measure of the average deviation in distance between aligned alpha-carbons. The higher this distance value the worse is the model. The 2bfd-model has a score of 1.1 and the 2r8o-model has a score of 3.1. This comparison shows clearly that the first model is mcuh better than the second one.

all atom RMSD

position	1QS0_A	2O1X_A
161	0.337	3.258
166	0.585	1.028
167	0.594	1.309

Superposition

Superposition of the Swissmodel model using template 2o1x and target 1U5B

Superposition of the Swissmodel model using template 1qs0 and target 1U5B

iTasser

Numeric evaluation

C-score

1qs0					2o1x
model1	model2	model3	model4	model5	model1	model2	model3	model4	model5
1.174	-0.190	-0.718	0.200	-5	-0.150	-1.276	-1.863	-2.155	-3.208

The C-score is a measure for the quality of predicted models by I-TASSER. C-score ranges between [-5,2], where a C-score of higher value signifies a model with a high confidence.

Comparison to experimental structure

	1qs0			2o1x
No	TMscore	RMSD (DaliLite)	RMSD (sap)	TMscore	RMSD (DaliLite)	RMSD (sap)
1	0.8539	2.2	0.869	0.5377	3.3	2.671
2	0.8627	1.9	0.834	0.8598	1.6	1.056
3	0.8437	2.1	0.940	0.4688	3.0	2.354
4	0.8523	2.2	0.880	0.4904	4.0	2.840
5	0.8363	2.4	0.984	0.4938	3.3	3.123

To calculate the RMSD of the 6A radius of the catalytic center we had to find the catalytic center first. There are three catalytic center on the positions 161, 166 and 167. We calculated the RMSD for all of them.

all atom RMSD

	1qs0			2o1x
model	161	166	167	161	166	167
1	0.739	0.826	0.786	1.009	1.542	1.807
2	0.700	0.759	0.590	0.592	0.771	0.581
3	1.177	0.786	0.844	2.363	4.685	5.078
4	0.906	0.852	0.989	0.798	1.211	2.984
5	0.739	0.926	0.830	1.609	1.174	3.539

Superposition 1qs0

iTasser model 1 for template 1qs0 superimposed on target 1U5B

iTasser model 2 for template 1qs0 superimposed on target 1U5B

iTasser model 3 for template 1qs0 superimposed on target 1U5B

iTasser model 4 for template 1qs0 superimposed on target 1U5B

2o1x

iTasser model 1 for template 2o1x superimposed on target 1U5B

iTasser model 2 for template 2o1x superimposed on target 1U5B

iTasser model 3 for template 2o1x superimposed on target 1U5B

iTasser model 4 for template 2o1x superimposed on target 1U5B

3DJigsaw

3DJigsaw is a server which builds protein models based on already predicted models for a specific target. It recombines the models and optimizes them.

We startet Jigsaw for different categories of sequence identity. The first category used models created by modeller, Swissmodel and iTasser for the 2bfd template. The second Jigsaw run recombined models for a template with low sequence identity (2r8o).

high sequence-identity category: The following models were chosen to build a recombined model with 3DJigsaw:

modeller model for template 2bfd
modeller model for multiple templates
swissmodel model for template 2bfd
iTasser model 1 for template 2bfd
iTasser model 3 for template 2bfd

As the predicted models have quite bad TM-scores (around 0.3), another 3DJigsaw run was startet using the five iTasser models for 2bfd as input. The first run was not evaluated further as the new results are expected to be better. The following models were chosen to build a better recombined model with 3DJigsaw:

iTasser model 1 for template 2bfd
iTasser model 2 for template 2bfd
iTasser model 3 for template 2bfd
iTasser model 4 for template 2bfd
iTasser model 5 for template 2bfd

low sequence-identity category: The following models were chosen to build recombined models with 3DJigsaw (inferred from models created with templates with low sequence identity):

iTasser model 1 for template 2r8o
iTasser model 2 for template 2r8o
iTasser model 3 for template 2r8o
iTasser model 4 for template 2r8o
iTasser model 5 for template 2r8o

These models were chosen because of their high TM-score.

Prediction for the high-sequence identity-category

AA:   SSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKL
Pred: CCCCCCCCCCCCCHHHHHHHCCCCHHHCCCCCEEEEECCCCCCCCCCCCCCCCHHHHHHH
Conf: 987768999799968982200028344168867999989985867653479988999999

AA:   YKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLM
Pred: HHHHHHHHHHHHHHHHHHHCCCEEEEECCCCHHHHHHHHHHHCCCCCEEEEECCCHHHHH
Conf: 999999999999999999879869982125848999999973586879997303389999

AA:   YRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKR
Pred: HCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCEECCCCHHHHHHHHHHHHHHHHHH
Conf: 879998999999707788888999872123446778104620467559999999999996

AA:   ANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIA
Pred: CCCCCEEEEEECCHHHHCCHHHHHHHHHHHCCCCEEEEEECCCCCCCCCCCCCCCHHHHH
Conf: 599818999986427738849999999986399889999888811246765456806899

AA:   ARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAY
Pred: HHHHHCCCCEEEECCCCHHHHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCCCCCCCC
Conf: 999863995999977189999999999999998539989999953378787899996628

AA:   RSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNP
Pred: CCHHHHHHHHHCCCHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHCCCCCH
Conf: 998999998853997999999999878999899999999999999999999971688898

AA:   NLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK
Pred: HHHHHHHHHHCCHHHHHHHHHHHHHHHHCCCCCCHHHHCC
Conf: 9999989774898999999999999987775578658709

Prediction for the low-sequence identity-category

AA:   SSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKL
Pred: CCCCCCCCCCCCCHHHHHHHCCCCHHHCCCCCEEEEECCCCCCCCCCCCCCCCHHHHHHH
Conf: 987768999799968982200028344168867999989985867653479988999999

AA:   YKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLM
Pred: HHHHHHHHHHHHHHHHHHHCCCEEEEECCCCHHHHHHHHHHHCCCCCEEEEECCCHHHHH
Conf: 999999999999999999879869982125848999999973586879997303389999

AA:   YRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKR
Pred: HCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCEECCCCHHHHHHHHHHHHHHHHHH
Conf: 879998999999707788888999872123446778104620467559999999999996

AA:   ANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIA
Pred: CCCCCEEEEEECCHHHHCCHHHHHHHHHHHCCCCEEEEEECCCCCCCCCCCCCCCHHHHH
Conf: 599818999986427738849999999986399889999888811246765456806899

AA:   ARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAY
Pred: HHHHHCCCCEEEECCCCHHHHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCCCCCCCC
Conf: 999863995999977189999999999999998539989999953378787899996628 

AA:   RSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNP
Pred: CCHHHHHHHHHCCCHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHCCCCCH
Conf: 998999998853997999999999878999899999999999999999999971688898

AA:   NLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK
Pred: HHHHHHHHHHCCHHHHHHHHHHHHHHHHCCCCCCHHHHCC
Conf: 9999989774898999999999999987775578658709

2.Evaluation of models

General

A detailed description of how the created models were evaluated can be found in the Evaluation Protocol. The following section presents only the modelling and evaluation results.

Two interesting score when comparing two structures for their structural similarity are the RMSD and the TM-Score. These are two measures which are usually used to measure the accuracy of modelling a structure when the native structure is known.

The RMSD is the average distance of all residue pairs in two structures. The C-alpha RMSD is the average distance between aligned alpha-carbons. The smaller the RMSD value, the better the predicted structure. A local error (e.g. misorientation of the tail) will result in a high RMSD value, although the global structure is correct.

As the RMSD is sensitive to the local error, the TM-Score was proposed. The TM-Score weights close matches stronger than distant matches and therefore the local error problem is overcome. A TM-Score <0.5 indicates a model with random structural similarity, wherease 0.5 < TM-score < 1.00 means the two compared structures are in about the same fold and therefore the predicted model has a correct topology.