Difference between revisions of "Homology based structure predictions"

From Bioinformatikpedia
(I-Tasser)
(TODO)
 
(198 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
<sup>by [[User:Greil|Robert Greil]] and [[User:Landerer|Cedric Landerer]]</sup>
  +
 
== Homologous ==
 
== Homologous ==
Because we found no homologous structures in Task 2, we extended our list by using HHSearch.
 
   
  +
[[Image:1bii.PNG|thumb|right|Figure 1: 1BII, the template structure]]
HHSearch found just sequences with an indentity below 40% therefore we will use the 12 proteins shown below for creating a multiple alignment for homologous modeling. We choose sequences to cover the whole protein and we pay specific attention on the transmembrane region.
 
  +
  +
Because we found no homologous structures in Task 2, we extended our list by using [http://toolkit.tuebingen.mpg.de/hhpred HHSearch].
  +
  +
HHSearch found just sequences with an identity below 40% therefore we will use the 12 proteins shown below for creating a multiple alignment for homologous modeling. We choose sequences to cover the whole protein and we pay specific attention on the transmembrane region. In the cases were we can just use one template, we will use 1BII as a template (Figure 1).
   
   
Line 10: Line 15:
 
|'''Description'''
 
|'''Description'''
 
|-
 
|-
| 1s79 || 37% || human La protein
+
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1S79 1S79] || 37% || human La protein
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=2WY3 2WY3] || 29% || HCMV UL16-MICB complex
| 3p73 || 28% || classical MHC class I molecule
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=3P73 3P73] || 28% || classical MHC class I molecule
| 1kcg || 22% || NKG2D
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=3JTS 3JTS] || 25% || Mamu A*2
| 1jfm || 14% || MURINE NK CELL LIGAND RAE-1 BETA
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1KCG 1KCG] || 22% || NKG2D
| 1bii || 22% || H-2DD MHC CLASS I
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1BII 1BII] || 22% || H-2DD MHC CLASS I
| 2p24 || 21% || alphabeta TCR
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1OW0 1OW0] || 22% || human FcaRI
| 1cd1 || 21% || MHC-like fold with a large hydrophobic binding groove
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=2P24 2P24] || 21% || alphabeta TCR
| 2wy3 || 29% || HCMV UL16-MICB complex
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1CD1 1CD1] || 21% || MHC-like fold with a large hydrophobic binding groove
| 1lqv || 14% || Endothelial protein C receptor
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1HXM 1HXM] || 18% || Human Vgamma9/Vdelta2 T Cell Receptor
| 3jts || 25% || Mamu A*2
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1LQV 1LQV] || 14% || Endothelial protein C receptor
| 1ow0 || 22% || human FcaRI
 
 
|-
 
|-
  +
| [http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1JFM 1JFM] || 14% || MURINE NK CELL LIGAND RAE-1 BETA
| 1hxm || 18% || Human Vgamma9/Vdelta2 T Cell Receptor
 
 
|-
 
|-
 
|}
 
|}
   
With these Sequences including the HFE-Gen(Q30201), we did a multiple sequence alignment with t-coffee(EXPRESSO). This multiple sequence alignment is later used as a raw alignment in the Alignment Mode of SwissModel and Modeller. Later on, we will try to fit better models by editing the alignment by keeping functional regions together.
+
With these sequences including the HFE-Gen(Q30201), we did a multiple sequence alignment with t-coffee(EXPRESSO). This multiple sequence alignment is later used as a raw alignment in the Alignment Mode of SwissModel and Modeller. Later on, we will try to fit better models by editing the alignment by keeping functional regions together.
   
 
DSSP --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEEEESSS--EEE--STTS-SSTTTTHHHHHHHHHHHHHHHHH
 
DSSP --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEEEESSS--EEE--STTS-SSTTTTHHHHHHHHHHHHHHHHH
Line 114: Line 119:
   
   
Based on the secondary structure for the HFE-Gen assigned by DSSP from the PDB structure (1a6z) the multible sequence alignmet conserves most parts of the secondary structure.
+
Based on the secondary structure for the HFE-Gen assigned by DSSP from the PDB structure (1a6z) the multiple sequence alignment conserves most parts of the secondary structure.<br>
  +
<br>
  +
As HHSearch found just weak homologous, we searched in CATH to find structure homologous. The BLAST search in CATH found sequence homologous in a range from 49% to 22%. The HFE protein is classified as a two domain protein (Alpha Beta, Mainly Beta)<ref>http://www.cathdb.info/domain/1a6zA01</ref>. We found both domains with a sequence similarity of 100%. We than used BLAST to test the results at random with another search against CATH. We found for several proteins the same sequence identity distribution. With this BLAST search, we are now sure HFE is a protein with a high conservation in structure elements but a very weak sequence conservation. Therefore we would recommend a new acceptance range of about 20% to 40% sequence similarity for this protein.
   
 
== I-Tasser ==
 
== I-Tasser ==
[[Image:Casp789_web.gif|thumb|right| perfomance of I-TASSER at CASP<br> Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html]]
+
[[Image:Casp789_web.gif|thumb|right|Figure 2: perfomance of I-TASSER at CASP<br> Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html]]
[[Image:Tasser_workflow.PNG|thumb|right|Workflow of the I-Tasser server<br> Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html]]
+
[[Image:Tasser_workflow.PNG|thumb|right|Figure 3: Workflow of the I-Tasser server<br> Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html]]
   
I-Tasser is a webservice for protein structure prediction provided and published by Ambrish Roy, Alper Kucukural and Yang Zhang at http://zhanglab.ccmb.med.umich.edu/I-TASSER/ for the CASP competition with outstanding achievement. <br>
+
I-Tasser<ref>Yang Zhang. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins</ref> is a webservice for protein structure prediction provided and published by Ambrish Roy, Alper Kucukural and Yang Zhang at http://zhanglab.ccmb.med.umich.edu/I-TASSER/ for the CASP competition with outstanding achievement (Figure 2). <br>
 
<br>
 
<br>
The I-Tasser protocol consists of serveral steps which are:<br>
+
The I-Tasser protocol consists of several steps which are:<br>
* threading the seqeunce into different structure to create an initial template.
+
* threading the sequence into different structures to create an initial template.
* break the template apart into fragments which matched the structure (leave out the parts of the structrue to which no sequence is assigned).
+
* break the template apart into fragments which match the structure (leave the parts of the structure out to which no sequence is assigned).
*Structure assembly and clustering
+
* Structure assembly and clustering
 
* use the cluster centroid for structure reassembly
 
* use the cluster centroid for structure reassembly
* search the structure with the lowest energie and do REMO H-bond optimization to get the final model.
+
* search the structure with the lowest energy and do REMO H-bond optimization to get the final model.
 
<br>
 
<br>
  +
A graphical workflow is shown in Figure 3.
   
Further on, I-Tasser also predict GO-Terms and binding sites. Therfore it use the final model to search for global and local matches in the PDB to predict these terms.
+
Further on, I-Tasser also predicts GO-Terms and binding sites. Therefore it uses the final model to search for global and local matches in the PDB to predict these terms.
 
<br>
 
<br>
   
  +
For us, a problem is that I-Tasser only generates complete models, but the PDB structure of our protein is not complete. Therefore we compared the predicted secondary structure with the one form UniProt.
  +
  +
'''Compare secondary structure''' of the model and the structure assigned in UniProt:<br>
  +
Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMEN
  +
Pred: CCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCEEEEECCCCCCCCCCEEEEEEECCCEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHH
  +
UniP: ---------------------------EEEEEEEEEEE----EEE--EEEEEE--EEEEEEEEEE--EEE--------TTTHHHHHHHHHHHHHHHHHHHHHHHHHHT
  +
  +
Seq: HNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHV
  +
Pred: HCCCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCEEEECCCHHHCHHHHHHHHHHHHHHHHCCCHHHHHHHHHCCCCHHHHHHHHHCCHHHHHCCCCCCCCCCCCC
  +
UniP: TT-EEE--EEEEEEEEEE-----EEEEEEEEE--EEEEEEEHHH-EEEEEE---HHHHHHHH---HHHHHHHHHHH-HHHHHHHHHHHHHTTT-------EEEEEEEE
  +
  +
Seq: TSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGI
  +
Pred: CCCHHHHCHHHHCCCCCCEEEEEEECCCCCCCCCCEEEECCCCCCCCCCCEEEEECCCCCCCCEEEECCCCCCCCCEEEECCCCCCCCCCCCCCCCHHHHHHHCCHHH
  +
UniP: ----EEEEEEEEEEEEE--EEEEEE------HHH----EEEE-----EEEEEEEEE---HHHHEEEEEE---EEE-EEEE----------------------------
  +
  +
Seq: LFIILRKRQGSRGAMGHYVLAERE
  +
Pred: HHHHHHCCCCCCCCCCCCCHCCCC
  +
UniP: ------------------------
  +
For a better overview we replaced the I-Tasser S for Sheet by an E like in the UniProt secondary structure.<br>
  +
<br>
  +
As we can see, the secondary structure predicted by I-Tasser is mostly correct. Sometimes we see a slightly shift in the structure and sometimes the secondary structure elements have not the correct length. As this model is also based on a self hit, it is not a surprise to see a good results like this one.
   
 
'''Predicted Secondary Structure by I-Tasser'''<br>
 
'''Predicted Secondary Structure by I-Tasser'''<br>
Line 150: Line 179:
 
Prediction: CCCCCCHHHHHHHCCHHHHHHHHHCCCCCCCCCCCCCHCCCC
 
Prediction: CCCCCCHHHHHHHCCHHHHHHHHHCCCCCCCCCCCCCHCCCC
 
Conf-Score: 010211112222100246665443013678898651020169
 
Conf-Score: 010211112222100246665443013678898651020169
Secondary structure elements are shown as H for Alpha helix,S for Beta sheet & C for Coil
+
Secondary structure elements are shown as H for Alpha helix, S for Beta sheet & C for Coil
   
 
'''Predicted Solvent Accessibility by I-Tasser'''
 
'''Predicted Solvent Accessibility by I-Tasser'''
Line 166: Line 195:
 
Values range from 0 (buried residue) to 9 (highly exposed residue)
 
Values range from 0 (buried residue) to 9 (highly exposed residue)
   
'''I-Tasser predicted five Models''' with a C-Score from -0.557 to -3.298. They are ranked from one to five as seen below. As cutoff for the C-Score, we use -1.5 as recommended by the Zhang group<ref>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010). [http://zhanglab.ccmb.med.umich.edu/papers/2010_3.pdf | Zhang et al.]</ref>.
+
'''I-Tasser predicted five Models''' with a C-Score from -0.557 to -3.298. They are ranked from one to five as seen below. As cutoff for the C-Score, we use -1.5 as recommended by the Zhang group<ref>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010). [http://zhanglab.ccmb.med.umich.edu/papers/2010_3.pdf Zhang et al.]</ref> that is proposed to give a false-positive and a false-negative rate of about 0.1. That means that more than 90% of the quality predictions are correct. Therefore we just use Model1 for the comparison with the other methods. All resulting models are shown below in Figure 4 to Figure 8.
   
 
{| class="centered"
 
{| class="centered"
|[[Image:Model1_HFE.gif|thumb| Model 1 with a C-Score of -0.557]]
+
|[[Image:Model1_HFE.gif|thumb|Figure 4: Model 1 with a C-Score of -0.557]]
|[[Image:Model2_HFE.gif|thumb| Model 2 with a C-Score of -2.539]]
+
|[[Image:Model2_HFE.gif|thumb|Figure 5: Model 2 with a C-Score of -2.539]]
|[[Image:Model3_HFE.gif|thumb| Model 3 with a C-Score of -2.266]]
+
|[[Image:Model3_HFE.gif|thumb|Figure 6: Model 3 with a C-Score of -2.266]]
|[[Image:Model4_HFE.gif|thumb| Model 4 with a C-Score of -2.772]]
+
|[[Image:Model4_HFE.gif|thumb|Figure 7: Model 4 with a C-Score of -2.772]]
|[[Image:Model5_HFE.gif|thumb| Model 5 with a C-Score of -3.298]]
+
|[[Image:Model5_HFE.gif|thumb|Figure 8: Model 5 with a C-Score of -3.298]]
 
|}
 
|}
   
Model1 has a TM-Score of about 0.64 and a RMSD of 7.7Å. For the prediction, I-Tasser used 1a6zA, 1s7qA, 1i4fA, 1de4A, 2vabA and 2bckA as templates. The templates have an identity of about 40% except for the self hit 1a6z. Because of the self hit, we run I-Tasser a second time with the constrain to exclude all templates with a sequence identity > 80%.
+
Model1 has a TM-Score of about 0.64 and a RMSD of 7.7Å. For the prediction, I-Tasser used 1a6zA, 1s7qA, 1i4fA, 1de4A, 2vabA and 2bckA as templates. The templates have an identity of about 40% except for the self hit 1a6z. A special case is 1de4 which is the transferin receptor, but in complex with the HFE protein (chain A) which is a self hit as well. The sequence in this case is also identical, but we can not give any conclusion about the 3D structure of the protein bind to the receptor. Because of the self hit, we run I-Tasser a second time with the constrain to exclude all templates with a sequence identity > 80%.
   
'''I-Tasser using templates with a sequence identity below 80%''' to avoid self hits.
+
'''I-Tasser using templates with a sequence identity below 80%''' to avoid self hits. <br>
  +
The second run brought to our surprise the same results based also on the same self hit. We have at this point no idea what went wrong but because the self hit is just one out of five templates used to create the model, we decided to keep the best model (Model1) for the comparison with the other methods.
   
 
== SwissModel ==
 
== SwissModel ==
SwissProt is a server based tool provided by the SIB. It combines tools like PSI-PRED and DISOPRED for secondary structure and disordered region prediction.<br>
+
'''SwissModel'''<ref>Benkert P, Künzli M, Schwede T. QMEAN server for protein model quality estimation.</ref> is a server based tool provided by the SIB. It combines tools like PSI-PRED and DISOPRED for secondary structure and disordered region prediction.<br>
  +
'''The SwissModel workspace''' is a web-based service dedicated to protein structure homology modeling. It provide a personal working environment where several projects can be calculated parallel. The environment provide tools for template selection, model building and structure quality evaluation as well. To find suitable templates for a given target protein a library of experimental protein structures is searched<ref>http://bioinformatics.oxfordjournals.org/cgi/content/short/22/2/195</ref>. <br>
  +
'''The SwissModel repository''' is a database of annotated 3d protein structure models. The database consists of more than 3.4 million structures<ref>http://nar.oxfordjournals.org/content/37/suppl_1/D387.full</ref>.
  +
All models were generated from the UniProt database with the SwissModel pipeline. Form the SwissModel repository the density of the QMEAN-Score is estimated to give a dent of the model quality of the predicted model.
   
 
<br>
 
<br>
The model created by SwissModel is based on a self hit, but we had no chance to exclude the protein itself from the prediction. Therefore we also run SwissModel in Alignment-Mode.(TODO)
+
The model created by SwissModel is based on a self hit, but we had no chance to exclude the protein itself from the prediction. We could just set a specific template, therefore we also run SwissModel in Alignment-Mode. So we had the chance to influence the alignment. And as one can see, the density of the QMEAN-Score and of the Automated mode and the Alignment mode are the same. Therefore the target (1a6z) and the template (1bii) are part of the same reference set. We take this as an indicator for a good template choice, because the template is in the same set as the target which is also used as a template in the Alignment mode. Therefore we rated this as evidence for the high diversity of the MHC 1 family.
   
 
===Automated Mode===
 
===Automated Mode===
[[Image:16az.jpg|thumb| predicted model]]
+
[[Image:16az.jpg|thumb|Figure 9: predicted model]]
   
 
 
Line 201: Line 234:
   
 
{| class="centered"
 
{| class="centered"
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png|thumb| Estimated absolute model quality]]
+
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png|thumb|Figure 10: Estimated absolute model quality]]
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_density_plot.png|thumb|Estimated density of model quality]]
+
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_density_plot.png|thumb|Figure 11: Estimated density of model quality]]
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_slider.png|thumb| Z-Score by category]]
+
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_slider.png|thumb|Figure 12: Z-Score by category]]
|[[Image:QMEAN_plots_energy_profile_plots_Batch.1.short.pdb_local_energy_profile_QMEANlocal.png|thumb|predicted error]]
+
|[[Image:QMEAN_plots_energy_profile_plots_Batch.1.short.pdb_local_energy_profile_QMEANlocal.png|thumb|Figure 13: predicted error for each position.]]
 
|}
 
|}
   
Even though the model is based on a self hit, the Z-Score is about -1, which means that the model is one standard deviation from the mean. The model is not quite unlikely but also not the most probable one.
+
Even though the model is based on a self hit, the Z-Score is about -1, which means that the model is one standard deviation from the mean. The model is not quite unlikely but also not the most probable one. Figure 9 shows the predicted structure based on the template. The typical two parallel helices on a beta-sheet are clearly observable. Figure 10 shows the QMEAN4-score distribution over the protein size. Figure 11 shows the density plot of the reference set. The set contains the score of structures with a similar size. Figure 11 shows the different scores which are used to calculate the final QMEAN score. Here we can see, that the torsion angles caused the most issues, which leads to a lower QMEAN score. Figure 12 shows the predicted error for each residue on an arbitrary scale. We see a higher error at the beginning, but more or less the same pattern (pattern size of about 100 aa) of error values over the whole protein.
   
 
===Alignment Mode===
 
===Alignment Mode===
  +
[[Image:Swissmodel_aliMode_model.jpg|thumb|Figure 14: predicted model]]
  +
'''Model information:'''<br>
  +
Modelled residue range: 1 to 272<br>
  +
Based on template: 1bii_A
  +
  +
'''Quality information:'''<br>
  +
QMEAN Z-Score: -2.065
   
  +
== Modeller ==
 
  +
{| class="centered"
  +
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot_aliM.png|thumb|Figure 15: Estimated absolute model quality]]
  +
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_density_plot_aliM.png|thumb|Figure 16: Estimated density of model quality]]
  +
|[[Image:QMEAN_plots_Batch.1.short.pdb_plot.png_slider_aliM.png|thumb|Figure 17: Z-Score by category]]
  +
|[[Image:QMEAN_plots_energy_profile_plots_Batch.1.short.pdb_local_energy_profile_QMEANlocal_aliM.png|thumb|Figure 18: predicted error for each position]]
  +
|}
  +
  +
TARGET 26 RSH SLHYLFMGAS EQDLGLSLFE
  +
1biiA 1 gsh slryfvtavs rpgfgeprym
  +
TARGET sss ssssssssss sss
  +
1biiA sss ssssssssss sss
  +
TARGET 49 ALGYVDDQLF VFYDHES--R RVEPRTPWVS SRISSQMWLQ LSQSLKGWDH
  +
1biiA 24 evgyvdntef vrfdsdaenp ryeprarwie -qegpeywer etrrakgneq
  +
TARGET ssssss sss sssss sss hhh hh hhhhh hhhhhhhhhh
  +
1biiA ssssss sss sssss sss hh hhhhh hhhhhhhhhh
  +
TARGET 97 MFTVDFWTIM ENH-NHSKES HTLQVILGCE MQEDNST-EG YWKYGYDGQD
  +
1biiA 73 sfrvdlrtal ryynqsaggs htlqwmagcd vesdgrllrg ywqfaydgcd
  +
TARGET hhhhhhhhhh hhh ssssssssss sss sss ss sssssss ss
  +
1biiA hhhhhhhhhh hhh ssssssssss sss sssss sssssss ss
  +
TARGET 145 HLEFCPDTLD WRAAEPRAWP TKLEWERHKI RARQNRAYLE RDCPAQLQQL
  +
1biiA 123 yialnedlkt wtaadmaaqi trrkweqa-g aaerdrayle gecvewlrry
  +
TARGET sssss s ss hh hhhhh hhhhhhhhh hhhhhhhhhh
  +
1biiA sssss s ss hhh hhhhhhh hhhhhhhhh hhhhhhhhhh
  +
TARGET 195 LELGRGVLDQ QVPPLVKVTH HVTS-SVTTL RCRALNYYPQ NITMKWLKDK
  +
1biiA 172 lkngnatllr tdppkahvth hrrpegdvtl rcwalgfypa ditltwqln-
  +
TARGET hhh ssssss sss ssss ssssss sssssss
  +
1biiA hhh ssssss sss ssss ssssss sssss
  +
TARGET 244 QPMDAKEFEP KDVLPNGDGT YQGWITLAVP PGEEQRYTCQ VEHPGLDQPL
  +
1biiA 221 geeltqemel vetrpagdgt fqkwasvvvp lgkeqkytch veheglpepl
  +
TARGET ss s sss s sssssssss sssss ss s
  +
1biiA ss s sss s sssssssss sss ss s
  +
TARGET 294 IVIW
  +
1biiA 271 tlrw-
  +
TARGET ss
  +
1biiA ss
  +
  +
As one can see, a very similar secondary structure in this alignment is shown, and also a very similar 3d structure. The RMSD for the model is about 2.9. This is a quite good results but just the residues which are superimposed are used for the calculation. So the missing beta-sheet is not a part of the calculation. But in general, we see results, comparable to the self hit model. Figure 15 and Figure 16 also show the score distribution compared to other models of the same size. In Figure 17, we are able to see, also the torsion angles causes the main issues like in the self hit model. This could mean, that the torsion angles in this protein are not that obvious. The predicted error shown in Figure 18 shows a comparable patterning like the predicted error of the self hit model shown in Figure 13. But the high peak in the beginning is missing.
  +
  +
== MODELLER ==
  +
MODELLER<ref>Eswar N. et. Al. Comparative protein structure modeling using MODELLER.</ref> is a standalone application used for protein structure modeling by satisfying spatial restraints. These restraints derive from different types of information, so the model is not only based on the target-template alignment (but it also could). MODELLER is capable of pairwise/multiple alignment, fold assignment and modeling of loops.
  +
  +
We downloaded and installed Modeller locally to our Windows PC and used the examples given at the [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Workflow_homology_modelling_glucocerebrosidase#MODELLER Workflow homology modeling glucocerebrosidase].
  +
  +
Our target has been set to the FASTA sequence of [http://www.uniprot.org/uniprot/Q30201.fasta HFE_HUMAN].
  +
Our standard template for the single template-target alignment has been set to chain A of '''1BII''', because it covers the whole sequence of the HFE_HUMAN. For the multiple sequence alignment we used additional to 1BII the protein structures '''1S79''' and '''3P73'''. Both, 1S79 and 3P73 were chosen because of the relative high sequence indentity of about 37% of 1S79 and because 3P73 is a classical MHC class I molecule with a similar function to the HFE_HUMAN protein.
  +
  +
=== Single template-target ===
  +
  +
==== Scripts ====
  +
  +
'''script_pairwise-alignment-template-target.py'''
  +
from modeller import *
  +
  +
env = environ()
  +
aln = alignment(env)
  +
mdl = model(env, file='1BII.pdb', model_segment=('FIRST:A', 'END:A'))
  +
aln.append_model(mdl, align_codes='1BII', atom_files='1BII.pdb')
  +
aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN')
  +
aln.align2d()
  +
aln.check()
  +
aln.write(file='pairwise-2d.ali', alignment_format='PIR')
  +
aln.align()
  +
aln.check()
  +
aln.write(file='pairwise.ali', alignment_format='PIR')
  +
  +
'''script_pairwise-to-model.py'''
  +
from modeller import *
  +
from modeller.automodel import *
  +
  +
env = environ()
  +
a = automodel(env,
  +
alnfile = 'pairwise.ali', #file:pir:alignment
  +
knowns = '1BII', #file:pdb:template
  +
sequence = 'HFE_HUMAN', #id:target
  +
assess_methods=(assess.DOPE, assess.GA341))
  +
a.starting_model= 1
  +
a.ending_model = 1
  +
a.make()
  +
b = automodel(env,
  +
alnfile = 'pairwise-2d.ali', #file:pir:alignment
  +
knowns = '1BII', #file:pdb:template
  +
sequence = 'HFE_HUMAN', #id:target
  +
assess_methods=(assess.DOPE, assess.GA341))
  +
b.starting_model= 2
  +
b.ending_model = 2
  +
b.make()
  +
  +
==== Alignments ====
  +
  +
We used two different alignments for Modeller, one without use of structural information at the template side:
  +
  +
'''pairwise.ali'''
  +
>P1;1BII
  +
structureX:1BII.pdb: 1 :A:+383 :P:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B;
  +
ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21;
  +
EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON
  +
HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28
  +
-------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
  +
WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI
  +
ALNEDLKTWTAADMAAQITRRKWEQAGAAER-DRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD
  +
VTLRCWALGFYPADITLTWQLNGEEL-TQEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL
  +
RW/IQKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTEFTP
  +
TETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI*
  +
  +
>P1;HFE_HUMAN
  +
sequence:reference: : : : :::-1.00:-1.00
  +
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDH--ESRRVEPRTP
  +
WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNST-EGYWKYGYDGQDHL
  +
EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV
  +
TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV
  +
IW-------------EPSPSGTLVI---------------------GVISGIAVFVVILFIGILFIILRKRQGSR
  +
GAMGHYV-------LAERE-------------------*
  +
  +
And one with the use of structural information at the template side:
  +
  +
'''pairwise-2d.ali'''
  +
  +
>P1;1BII
  +
structureX:1BII.pdb: 1 :A:+383 :P:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B;
  +
ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21;
  +
EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON
  +
HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28
  +
---------------------G----SHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
  +
WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI
  +
ALNEDLKTWTAADMAAQITRRKWE-QAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD
  +
VTLRCWALGFYPADITLTWQLNGEELT-QEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL
  +
RW/I---QKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTE
  +
FTPTETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI*
  +
  +
>P1;HFE_HUMAN
  +
sequence:reference: : : : :::-1.00:-1.00
  +
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYD--HESRRVEPRTP
  +
WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSK-ESHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHL
  +
EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV
  +
TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV
  +
IW-EPSPSGTLVIGVIS---------GIAVFVVILF--IGILFIILRK-RQGSRGAMGH---------YVLAERE
  +
-----------------------------------------*
  +
  +
The models will be presented under [[Task_4#Model_comparison|model comparison]], but surprisingly the model with the structural information is worse than the model without. We think Modeller has some issues to threader the sequence of HFE_HUMAN into the given structure if 1BII. Therefore, we derive the possibility that 1a6z, which have a very similar structure to 1bii, has a different amino acid composition for this type of structure. But at the moment we have no chance to test and prove this.
  +
  +
=== Alignment: multiple template-target ===
  +
  +
==== Scripts ====
  +
  +
'''script_msa-align-templates.py'''
  +
from modeller import *
  +
  +
env = environ()
  +
aln = alignment(env)
  +
for (code, chain) in (('1BII', 'A'), ('1S79', 'A'), ('3P73', 'A')):
  +
mdl = model(env, file=code, model_segment=('FIRST:'+chain, 'LAST:'+chain))
  +
aln.append_model(mdl, atom_files=code, align_codes=code+chain)
  +
aln.salign()
  +
aln.check()
  +
aln.write(file='MSA.ali', alignment_format='PIR')
  +
  +
'''script_msa-align-target-to-msa.py'''
  +
from modeller import *
  +
  +
env = environ()
  +
aln = alignment(env)
  +
aln.append(file='MSA.ali', align_codes='all')
  +
aln_block = len(aln)
  +
aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN')
  +
aln.salign()
  +
aln.check();
  +
aln.write(file='MSA.ali', alignment_format='PIR')
  +
  +
'''script_msa-to-model.py'''
  +
from modeller import *
  +
from modeller.automodel import *
  +
  +
env = environ()
  +
a = automodel(env,
  +
alnfile = 'MSA.ali', #file:pir:alignment
  +
knowns = ('1BIIA', '1S79A', '3P73A'), #file:pdb:template
  +
sequence = 'HFE_HUMAN', #id:target
  +
assess_methods=(assess.DOPE, assess.GA341))
  +
a.starting_model = 1
  +
a.ending_model = 1
  +
a.make()
  +
  +
==== Alignment ====
  +
  +
The MSA used by Modeller is:
  +
  +
>P1;1BIIA
  +
structureX:1BII:1 :A:+274 :A:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B;
  +
ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21;
  +
EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE
  +
MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28
  +
-------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
  +
WIEQEGPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYIA
  +
LNEDLKTWTAADMAAQITRRKWEQAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGDVT
  +
LRCWALGFYPADITLTWQLNGEELTQ-EMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTLRW
  +
---------------------------------------------------*
  +
  +
>P1;1S79A
  +
structureX:1S79:100 :A:+103 :A:MOL_ID 1; MOLECULE LUPUS LA PROTEIN; CHAIN A; FRAGMENT CENTRAL RRM; SYNONYM SJOGREN SYNDROME TYPE B ANTIGEN, SS-B, LA RIBONUCLEOPROTEIN, LA AUTOANTIGEN; ENGINEERED YES:MOL_ID
  +
1; ORGANISM_SCIENTIFIC HOMO SAPIENS; ORGANISM_COMMON HUMAN; ORGANISM_TAXID 9606; GENE SSB; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21(DE3)PLYSS;
  +
EXPRESSION_SYSTEM_VECTOR PET28:-1.00:-1.00
  +
-------------------------GRWILKNDVKNRSVYIKGFPTDATLDDIK---------------------
  +
---------------------------------------------------------------------------
  +
----------------------------------------EWLEDKGQVLNIQMRRT------------------
  +
--------------LHKAFKGSIFVV-FDSIESAKKFVETPGQKYKETDLLILFKDDYFAKKNEERKQNKVE---
  +
---------------------------------------------------*
  +
  +
>P1;3P73A
  +
structureX:3P73:-1 :A:+275 :A:MOL_ID 1; MOLECULE MHC RFP-Y CLASS I ALPHA CHAIN; CHAIN A; FRAGMENT UNP RESIDUES 20-294; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2-MICROGLOBULIN; CHAIN B; ENGINEERED
  +
YES:MOL_ID 1; ORGANISM_SCIENTIFIC GALLUS GALLUS; ORGANISM_COMMON BANTAM,CHICKENS; ORGANISM_TAXID 9031; GENE YFV; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN
  +
TB1; EXPRESSION_SYSTEM_VECTOR_TYPE PLASMID; EXPRESSION_SYSTEM_PLASMID PMAL-P4X; MOL_ID 2; ORGANISM_SCIENTIFIC GALLUS GALLUS; ORGANISM_COMMON BANTAM,CHICKENS; ORGANISM_TAXID 9031; GENE B2M; EXPRESSION_SYSTEM
  +
ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN TB1; EXPRESSION_SYSTEM_VECTOR_TYPE PLASMID; EXPRESSION_SYSTEM_PLASMID PMAL-P4X: 1.32: 0.16
  +
-----------------------EFGSHSLRYFLTGMTDPGPGMPRFVIVGYVDDKIFGTYNSKSRTA--QPIVE
  +
MLPQEDQEHWDTQTQKAQGGERDFDWNLNRLPERYNKSKG-SHTMQMMFGCDILEDGS-IRGYDQYAFDGRDFLA
  +
FDMDTMTFTAADPVAEITKRRWETEGTYAERWKHELGTVCVQNLRRYLEHGKAALKRRVQPEVRVWGKEADGILT
  +
LSCHAHGFYPRPITISWMKDGMVRDQ-ETRWGGIVPNSDGTYHASAAIDVLPEDGDKYWCRVEHASLPQPGLFSW
  +
EP------------------------------------------------Q*
  +
  +
>P1;HFE_HUMAN
  +
sequence:reference: : : : :::-1.00:-1.00
  +
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVE-PRTPW
  +
VSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHLE
  +
FCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTT
  +
LRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIW
  +
EPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE*
  +
  +
==== Model editing ====
  +
We tried to edit the single-template model of MODELLER, because it is one of our best models. As we looked at our alignment with Jalview 2.6 (Figure 19), we noticed that the alignment is already very well defined and changes will only lead to worse results. The average conservation is about 7 to 8 and the quality around 5 to 6.
  +
  +
[[Image:Model_modeller_pw.png|thumb|450px|Figure 19: visualization of the single-template model of MODELLER done by Jalview]]
  +
  +
The hydrophobic groups are also very well aligned, so we decided to leave that model as it is, because there is nothing to edit. Only the end of the alignment has much gaps, but shifting the gaps would result in a break of the conserved block in the middle of the alignment.
  +
  +
The only difference between the see-supported model and the single-template model are the different aligned residues (Figure 20). These result from the information about the secondary structure of the template incorporated into the model and thus we will not edit them.
  +
  +
[[Image:Model_modeller_pw-2d.png|thumb|450px|Figure 20: visualization of the see-supported model of MODELLER done by Jalview]]
  +
  +
It is hard to edit the msa model because of the multiple alignments between the different sequences. We tried changing some aligned groups to different position inside the sequence alignment, but were not able to manage the corresponding alignment at the other sequences. After some unfruitful tempts we decided to leave also that alignment as it is (Figure 21).
  +
  +
[[Image:Model_modeller_msa.png|thumb|450px|Figure 21: visualization of the multi-template model of MODELLER done by Jalview]]
  +
  +
  +
In a summary, we have not edited any alignment successful because there was nothing to edit or it was too complicated and introduced too much errors.
   
 
==Model comparison==
 
==Model comparison==
  +
===3D-Jigsaw===
  +
We had several issues with the execution of 3D-Jigsaw<ref>Bates, P.A., Kelley, L.A., MacCallum, R.M. and Sternberg, M.J.E. (2001) Enhancement of Protein Modelling by Human Intervention in Applying the Automatic Programs 3D-JIGSAW and 3D-PSSM.</ref>, like strange error messages and non accepting of our input. Finally we got it to work with the following instruction:
  +
  +
* Server: http://bmm.cancerresearchuk.org/~populus/populus_submit.html
  +
* Mode: upload
  +
* sequence box: FASTA sequence of [[http://www.pdb.org/pdb/files/fasta.txt?structureIdList=1A6Z 1A6Z_A]]
  +
* own models: one pdb file containing all of our models (except the SwissModel selfhit) separated by the 'TER' command
  +
* predicted runtime: 18.33 hours
  +
  +
==== Result ====
  +
  +
{| border layout = 1
  +
!Name
  +
!Data
  +
|-
  +
| Length || <font face="Courier New">_________10________20________30________40________50________60________70________80________90________100_______110_______120_______130_______140_______150_______160_______170_______180_______190_______200_______210_______220_______230_______240_______250_______260_______270___ </font>
  +
|-
  +
| AA || <font face="Courier New">RLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIW </font>
  +
|-
  +
| Prediction || <font face="Courier New"> CCCC<font color = blue>EEEEEEEEEE</font>CCCCCCCCCC<font color = blue>EEEEEEE</font>CCCC<font color = blue>EEEE</font>CCCCCCCCC<font color= red>HHHHHH</font>CCCCC<font color= red>HHHHHHHHHHHHH</font>CC<font color= red>HHHHHHHHHHHHH</font>CCCCCC<font color = blue>EEEE</font>CCCCC<font color = blue>EE</font>CCCCCCCC<font color = blue>EEE</font>CCCCCC<font color = blue>EEEEE</font>C<font color= red>HHHHHH</font>CCCCC<font color= red>HHH</font>CC<font color= red>HHHH</font>CCC<font color= red>HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH</font>CCCCCC<font color = blue>EEEEEEEEE</font>CC<font color = blue>EEEEEEEE</font>CCCCCCC<font color = blue>EEEEEEE</font>CC<font color = blue>EE</font>CC<font color= red>HHH</font><font color = blue>EEEEEEE</font>CCCCCCCCC<font color = blue>EEEEEE</font>CCCCCCC<font color = blue>EEEEEEE</font>CCCCCC<font color = blue>EEEE</font>C </font>
  +
|-
  +
| Confidence || <font face="Courier New">93303453556763258999987358999894885499848988867403652146670222210276553000136609989986528798168852425544699858524402146712899971352101652001125776224548999999999998999999999987887332589869999951389499999762611761499996677566832279987550889983003699826988533699999504888767859 </font>
  +
|-
  +
| Disorder || <font face="Courier New"><font color = orange>DDDD</font><font color = green>OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO</font><font color = orange>D</font><font color = green>OOOOOOOOOO</font><font color = orange>DDDDDDDDDD</font><font color = green>OOOO</font><font color = orange>DDD</font><font color = green>OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO</font><font color = orange>DD</font><font color = green>OOO</font><font color = orange>DDDDD</font><font color = green>OOOOOOOOOOOOOOOOOOOOOOOOOOOO</font><font color = orange>DD</font><font color = green>OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO</font><font color = orange>DDD</font><font color = green>OOOOOOOOOOOO</font><font color = orange>D</font><font color = green>OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO</font><font color = orange>D</font><font color = green>OOO</font><font color = orange>DD</font> </font>
  +
|-
  +
|}
  +
  +
3D-Jigsaw gave us also information about the predicted secondary structure and the ordered and disordered regions. It used that information to successfully optimize all of our submitted models. All optimized models have an energy around ~ -340 and an coverage of 0.99, which is really good. Their date and pictures are visible at the table below.
  +
  +
===Model evaluation===
  +
  +
After trying serveral tools (RasWin, JMol, SwissPDB-Viewer), we decided to use PyMol for superimposing and displaying the model-target alignment of the proteins. We truncated the original HFE_HUMAN protein (pdbid: 1a6z) at chain C, thus we used only chain A and B for displaying. The original HFE_HUMAN is always shown in green and the model in red (see table below).
  +
  +
We created our models using PyMol by:
  +
* load '1A6Z_AB.pdb' into PyMol (alternatively: command 'fetch 1A6Z' and then hide chain C and D)
  +
* hide everything
  +
* show cartoon
  +
* color red
  +
* load 'model.pdb' into PyMol
  +
* hide everything
  +
* show cartoon
  +
* color green
  +
* align 'model.pdb' to '1A6Z_AB.pdb'
  +
* command 'ray' (nicer output!)
  +
* save the image
  +
  +
  +
For evaluating our models with the RMSD and TMScore we used TMalign. We were advised to use SAP for the RMSD and TMScore for the TMScore but TMScore failed because our target is the sequence of the HFE_HUMAN from UniProt and therefore longer than the '1BII' template. This causes a problem with TMScore because it needs pdbs with same length and the thus the superimposing of TMScore does not really work.
  +
  +
TMalign is able to use pdbs with different length and the scores are normalized by the second structure. We use '1A6Z' as second structure to create comparable scores of all our models. The modeling of HFE_HUMAN is very difficult because it is a multi domain protein. All the methods do not support a multi domain modeling.
  +
  +
[http://zhanglab.ccmb.med.umich.edu/TM-align/ TMalign] can be found at the website of the Zhang-Lab.
  +
  +
  +
{| border="1" style="text-align:center; border-spacing:0;"
  +
|'''Picture'''
  +
|'''Model'''
  +
|'''RMSD'''
  +
|'''TM-Score'''
  +
|'''Optimized picture'''
  +
|'''Optimized RMSD'''
  +
|'''Optimized TM-Score'''
  +
|'''3D-JigSaw energy calculation'''
  +
|-
  +
| [[Image:Modeller_1a6z(green,AB)_model(red).png|thumb| MODELLER: superimposed, green:1a6z, red:model(1BII)]]|| MODELLER: superimpose, template:1BII || 2.58 || 0.86468 || [[Image:3dj_m1_pw_1a6z(green,AB)_model(red).png|thumb| Optimized MODELLER pw-model by 3D-Jigsaw]]|| 1.70 || 0.95082 || -341.87
  +
|-
  +
| [[Image:Modeller_sse-support_1a6z(green,AB)_model(red).png|thumb| MODELLER: sse-support, superimposed, green:1a6z, red:model(1BII)]] || MODELLER: superimpose, see-support, template:1BII || 3.42 || 0.59586 || [[Image:3dj_m2_pw2d_1a6z(green,AB)_model(red).png|thumb| Optimized MODELLER sse-model by 3D-Jigsaw]] || 0.98 || 0.96990 || -341.41
  +
|-
  +
| [[Image:Modeller_msa_1a6z(green,AB)_model(red,1BII,1S79,3P73).png|thumb| MODELLER: msa, superimposed, green:1a6z, red:model(1BII,1S79,3P73)]] || MODELLER: superimpose, msa, template:1BII,1S79,3P73 || 2.05 || 0.89042 || [[Image:3dj_m3_msa_1a6z(green,AB)_model(red).png|thumb| Optimized MODELLER msa-model by 3D-Jigsaw]] || 1.70 || 0.95087 || -341.23
  +
|-
  +
| [[Image:I_tasser_1a6z_superimpose.PNG|thumb| I_Tasser: superimposed, green:1a6z, red:model]] || I-Tasser || 1.61 || 0.93760 || [[Image:3dj_m4_itas_1a6z(green,AB)_model(red).png|thumb| Optimized I_Tasser model by 3D-Jigsaw]] || 2.48 || 0.87855 || -339.33
  +
|-
  +
| [[Image:Swissmodel_1a6z_superimposed.PNG|thumb| SwissModel: superimposed, green:1a6z, red:model]] || SwissModel || 2.67 || 0.85048 || [[Image:3dj_m5_sm_1a6z(green,AB)_model(red).png|thumb| Optimized SwissModel model by 3D-Jigsaw]] || 2.48 || 0.87851 || -339.17
  +
|-
  +
| [[Image:I_tasser_swiss_self_superimpose.PNG|thumb| SwissModel: self-hit, superimposed, green:1a6z, red:model]] || SwissModel self || 0.08 || 0.99984
  +
|-
  +
|}
  +
  +
As one can clearly see, the I-Tasser model is the best with an TM-Score ~0.94 followed by the MSA model of MODELLER with an TM-Score of ~0.89 and the SwissModel with an TM-Score of ~0.85.
  +
  +
The worst model is the secondary structure supported information at the template site model of MODELLER with an TM-Score of ~0.6. We are sure, that the low sequence identity and secondary structure similarity of only 22% affected this model the bad way, because the normal model is also based on the same template and achieves an significantly higher TM-Score.
  +
  +
All of our models are really good, except for the sse-supported model of MODELLER.
  +
  +
After optimization by 3D-Jigsaw all MODELLER model are much better because 3D-Jigsaw cut off those clearly wrong modeled strands of useless amino acids. Surprisingly, the worse sse-supported model is now the best of all MODELLER models and even better than the previous best model of I-Tasser. It is not surprising that 3D-Jigsaw was also able to optimize the Swissmodel model but failed at the I-Tasser model, because it incorporated the information of the not so well done models. But surprisingly, the RMSD of the I-Tasser model got worse after the 3D-Jigsaw optimization.
  +
  +
Our models are still all very good, but the best one is now the sse-supported model of MODELLER with an RMSD of below 1 and and TM-Score of almost 1; it is now almost an perfect model. The second and third best models are standard pairwise and the msa model of MODELLER which are now very similar according to the RMSD and TM-Score. The I-Tasser and SwissModel model are now both very similar, too.
  +
  +
==Discussion==
  +
  +
For the I-Tasser protocol, it is not possible to choose a specific template, so we run I-Tasser twice, first with standard parameter, and one with a similarity threshold of 80%. In the second case, we got a model also based on a self hit. So we repeated the prediction a third time with the same result. We were not able to find out for what reason the given threshold was ignored.
  +
  +
Our attempts to get homologous at all given categories (>60%, >40%, >20%) was not successful, because HHSearch was not able to list matching ones. Doing a Blast search against the NR-Database also failed to provide acceptable results and resulted only in proteins with 40% or less sequence identity. Thus we come to the conclusion, that the HFE family must have a very high diversity of the sequence by a high structural conservation. This theory got supported as we did an alignment of structural homologous listed in [http://www.cathdb.info/ CATH].
  +
  +
The templates which we had chosen from the HHSearch were used to cover the whole protein sequence and give a special coverage of the transmembrane region. But as we saw later the tools do not support multiple sequence alignments. Therefore we decided to use '1BII' as template for SwissModel and Modeller because it covers the sequence completely and with a sequence similarity of 22% it is in the lower midrange of the HHSearch results. '1S79' has with 37% more sequence identity but also a very worse conservation with HFE_HUMAN. We decided to rank coverage of the whole sequence higher than the sequence identity.
  +
  +
After this task, we would suggest SwissModel to use in the first place to get a quick overview and a first idea about the protein structure. We also would advice I-Tasser because of its nice usability. The Modeller approach we would just advise for experts, which are really interested in a special alignment, as the usability is awful for layman.
  +
  +
==== Extra diligence task ====
  +
We were not able to perform the task of calculating the RMSD of all atoms inside an 6 Angström threshold of the catalytic core, because there is no one defined at [http://www.uniprot.org/uniprot/Q30201#section_features UniProt:Q30201(HFE_HUMAN)] and also not at [http://www.pdb.org/pdb/explore/explore.do?structureId=1A6Z PDB:1A6Z].
   
 
== References ==
 
== References ==
 
<references />
 
<references />
  +
  +
[[Category : Hemochromatosis]]

Latest revision as of 14:59, 30 August 2011

by Robert Greil and Cedric Landerer

Homologous

Figure 1: 1BII, the template structure

Because we found no homologous structures in Task 2, we extended our list by using HHSearch.

HHSearch found just sequences with an identity below 40% therefore we will use the 12 proteins shown below for creating a multiple alignment for homologous modeling. We choose sequences to cover the whole protein and we pay specific attention on the transmembrane region. In the cases were we can just use one template, we will use 1BII as a template (Figure 1).


PDB-ID Identity Description
1S79 37% human La protein
2WY3 29% HCMV UL16-MICB complex
3P73 28% classical MHC class I molecule
3JTS 25% Mamu A*2
1KCG 22% NKG2D
1BII 22% H-2DD MHC CLASS I
1OW0 22% human FcaRI
2P24 21% alphabeta TCR
1CD1 21% MHC-like fold with a large hydrophobic binding groove
1HXM 18% Human Vgamma9/Vdelta2 T Cell Receptor
1LQV 14% Endothelial protein C receptor
1JFM 14% MURINE NK CELL LIGAND RAE-1 BETA

With these sequences including the HFE-Gen(Q30201), we did a multiple sequence alignment with t-coffee(EXPRESSO). This multiple sequence alignment is later used as a raw alignment in the Alignment Mode of SwissModel and Modeller. Later on, we will try to fit better models by editing the alignment by keeping functional regions together.

  DSSP                                   --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEEEESSS--EEE--STTS-SSTTTTHHHHHHHHHHHHHHHHH
Q30201          MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHES--RRVE-PRTPWVSSRISSQMWLQLSQSLKGWDHM
1S79_A          --------------------------------------------------------------------GRW-IL-KNDVKNRSVYIKGFPTDATLDDIKE
3P73_A          -----------------------EFGSHSLRYFLTGMTDPGPGMPRFVIVGYVDDKIFGTYNSKS--RTAQ-PIVEML-PQEDQEHWDTQTQKAQGGERD
1KCG_C          -------------------------DAHSLWYNFTIIHLPRHGQQWCEVQSQVDQKNFLSYDCGS--DKVLSMGHL-EEQLYATDAWGKQLEMLREVGQR
1JFM_A          -------------------------DAHSLRCNLTIKDPTPADPLWYEAKCFVGEILILHLSNIN--KTMT-SG-DPGETANATEVKKCLTQPLKNLCQK
1BII_A          -MGAMAPRTLLLLLAAALGPTQTRAGSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYE-PRARWIE-QEGPEYWERETRRAKGNEQS
2P24_A          ----------------------------------------------------------------------------M----AIMAPRTLVLLLSGALALT
1CD1_A          -----------------------QQKNYTFRCLQMSSFANR-SWSRTDSVVWLGDLQTHRWSNDS--ATIS-FTKPWSQGKLSNQQWEKLQHMFQVYRVS
2WY3_A          ------------------------MEPHSLRYNLMVLSQDESVQSGFLAEGHLDGQPFLRYDRQK--RRAK-PQGQWAEDVLGAETWDTETEDLTENGQD
1LQV_A          -------------------SQDASDGLQRLHMLQISYFR-DPYHVWYQGNASLGGHLTHVLEGPDTNTTII-QLQPL----QEPESWARTQSGLQSYLLQ
3JTS_A          -------------------------GSHSMRYFYTSMSRPGRWEPRFIAVGYVDDTQFVRFDSDAASQRME-PRAPWVE-QEGPEYWDRETRNMKAETQN
1OW0_A          ----------------------------------------------------------------------------------------------------
1HXM_A          -------------------------------------------------------------------------------------AIELVPEHQTVPVSI
                                                                 
  DSSP          HHHHHHHHTTT-SSS--E--------EEEEEE-EEE-TTS-E-EEE-E------------EEEETTEE----------------EEEEEGGGTEEEES--
Q30201          FTVDFWTIMENHN-HSKE--------SHTLQV-ILGCEMQED-NST-E------------GYWKYGYD----------------GQDHLEFCPDTLDW--
1S79_A          WLEDKGQV-LNIQMRRTL--------HKAFKG-SIFVVFDSI-ESA-KKFVETPGQKYKETDLLILFKDDYFAKKNEERKQNKVE---------------
3P73_A          FDWNLNRLPERYN-KSKG--------SHTMQM-MFGCDILED-GSI-R------------GYDQYAFD----------------GRDFLAFDMDTMTF--
1KCG_C          LRLELADT---------ELEDFTPSGPLTLQV-RMSCECEAD-GYI-R------------GSWQFSFD----------------GRKFLLFDSNNRKW--
1JFM_A          LRNKVSNT-KVDTHKTNG--------YPHLQV-TMIYPQSQG-RTP-S------------ATWEFNIS----------------DSYFFTFYTENMSW--
1BII_A          FRVDLRTALRYYNQSAGG--------SHTLQW-MAGCDVESD-GRLLR------------GYWQFAYD----------------GCDYIALNEDLKTW--
2P24_A          QTWAGSHSRGEDD--IEA--------DHVGSYGIVVYQSP----GD-I------------GQYTFEFD----------------GDELFYVDLDKKET--
1CD1_A          FTRDIQELVKMMSPKEDY--------PIEIQL-SAGCEMYPG-NAS-E------------SFLHVAFQ----------------GKYVVRFWG--TSWQT
2WY3_A          LRRTLTHI----KDQKGG--------LHSLQE-IRVCEIHED-SST-R------------GSRHFYYN----------------GELFLSQNLETQES--
1LQV_A          FHGLVRLVHQERT--LAF--------PLTIRC-FLGCELPPEGSRA-H------------VFFEVAVN----------------GSSFVSFRPERALW--
3JTS_A          APVNLRNLRGYYNQSEAG--------SHTIQR-MYGCDLGPD-GRLLR------------GYHQSAYD----------------GKDYIALNEDLRSW--
1OW0_A          -----ACHPRLSLHRPAL--------EDLLLG-SEANLTCTL-TGLRD------------ASGVTFTW----------------TPSSGKSAV--QGPPE
1HXM_A          GVPATLRCSMKGEAIGNY--------YINWYR-KTQGNTMTF-IYRE-------------KDIYGPGF----------------KDNFQGDIDIAKNL--
                                                                 
  DSSP          SGG-G----HHH-HHHHHSSTHHH--HHHHHHHHTHHHHHHHHHHHHHTTTSS--B--EEEEEEEE-SS-----E-EEEEEEEEEBSS--EEEEEETTEE
Q30201          RAA-E----PRA-WPTKLEWERHK--IRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVT----S-SVTTLRCRALNYYPQNITMKWLKD
1S79_A          ----------------------------------------------------------------------------------------------------
3P73_A          TAA-D----PVA-EITKRRWETEG--TYAERWKHELGTVCVQNLRRYLEHGKAALKRRVQPEVRVWGKEA----D-GILTLSCHAHGFYPRPITISWMKD
1KCG_C          TVV-H----AGA-RRMKEKWEKDS--GLTTFFKMVSMRDCKSWLRDFLMHRKKRLE--------------------------------------------
1JFM_A          RSA-N----DES-GVIMNKWKDDG--EFVKQLKFLI-HECSQKMDEFLKQSKEK----------------------------------------------
1BII_A          TAA-D----MAA-QITRRKWEQA---GAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRR----PEGDVTLRCWALGFYPADITLTWQLN
2P24_A          IWM-------------LPEFAQLR--SFDPQGGLQNIATGKHNLGVLTKRSNSTPATNEAPQATVFPKSP--VLLGQPNTLICFVDNIFPPVINITWLRN
1CD1_A          VPGAP----SWL-DLPIKVLNADQ--GTSATVQMLLNDTCPLFVRGLLEAGKSDLEKQEKPVAWLSSVP---SSAHGHRQLVCHVSGFYPKPVWVMWMRG
2WY3_A          TVP-QSSRAQTLAMNVTNFW-KEDAMKTKTHYRAMQ-ADCLQKLQRYLKSGVAIRRTVPPMVNVTCSEVS----EGNITVTCRASSFYPRNITLTWRQDG
1LQV_A          QAD-TQVTSGVV-TFTLQQLNAYN--RTRYELREFLEDTCVQYVQKHISAENTKGSQTSRSYTS------------------------------------
3JTS_A          TAA-D----MAA-QNTQRKWEAA---GEAEQHRTYLEGECLEWLRRYLENGKETLQRADPPKTHVTHHPV----SDQEATLRCWALGFYPAEITLTWQRD
1OW0_A          R--DL----CGC-YSVSSVLPGCA--EPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEELALNELVTLTCLARGFSPKDVLVRWLQG
1HXM_A          AVL-K----ILA-PSERDEGSYYC--ACDTLGMGGEYTDKLIFGKGTRVTVEPRSQPHTKPSVFVMKNG---------TNVACLVKEFYPKDIRINLVSS
                                                                  
  DSSP          --GGGS---EEEE-TTS-E----EEEEEEEE-TTGGGGEE---EEEE-TTSSS-EEE-E-
Q30201          K-QPMDAKEFEPKDVLPNG----DGTYQGWITLAVPPGEE---QRYTCQVEHPGLDQ-PLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQ
1S79_A          ----------------------------------------------------------------------------------------------------
3P73_A          --GMVRDQETRWGGIVPNS----DGTYHASAAIDVLPEDG---DKYWCRVEHASLPQ-PGLFSWEPQ---------------------------------
1KCG_C          ----------------------------------------------------------------------------------------------------
1JFM_A          ----------------------------------------------------------------------------------------------------
1BII_A          --GEELTQEMELVETRPAG----DGTFQKWASVVVPLGKE---QKYTCHVEHEGLPE-PLTLRWGKEEPPSSTKTNTVIIAVPVVLGAVVILGAVMAFVM
2P24_A          --SKSVADGVYETSFFVNR----DYSFHKLSYLTFIPSDD---DIYDCKVEHWGLEE-PVLKHWEPEIPAPMSELTETSGSRLEVLFQ------------
1CD1_A          --DQ-EQQGTHRGDFLPNA----DETWYLQATLDVEAGEE---AGLACRVKHSSLGG-QDIILYWDARQAPVGLIVFIVLIMLVVVGAVVYYIWRRRSAY
2WY3_A          --VSLSHNTQQWGDVLPDG----NGTYQTWVATRIRQGEE---QRFTCYMEHSGNHG-THPVPSGKVLVLQSQRTDFPYVSAAMPCFVIIIILCVPCCKK
1LQV_A          ----------------------------------------------------------------------------------------------------
3JTS_A          --GEDQTQDTELVETRPAG----DGTFQKWAAVVVPSGKE---QRYTCHVQHEGLRE-PLTLRWEP----------------------------------
1OW0_A          SQEL-PREKYLTW-ASRQEPSQGTTTFAVTSILRVAAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGK------------------------------
1HXM_A          -----KKITEFDPAIVISP----SGKYNAVKLGKYE--DS---NSVTCSVQHDNK---TVHSTDFEVKTDSTDHVKPKETENTKQPSKS-----------
                                                                  
  DSSP
Q30201          GSRGAMGHYVLAERE----------------
1S79_A          -------------------------------
3P73_A          -------------------------------
1KCG_C          -------------------------------
1JFM_A          -------------------------------
1BII_A          KRRRNTGGKGGDYALAPGSQSSDMSLPDCKV
2P24_A          -------------------------------
1CD1_A          QDIR---------------------------
2WY3_A          KTSAAEGP-----------------------
1LQV_A          -------------------------------
3JTS_A          -------------------------------
1OW0_A          -------------------------------
1HXM_A          -------------------------------


Based on the secondary structure for the HFE-Gen assigned by DSSP from the PDB structure (1a6z) the multiple sequence alignment conserves most parts of the secondary structure.

As HHSearch found just weak homologous, we searched in CATH to find structure homologous. The BLAST search in CATH found sequence homologous in a range from 49% to 22%. The HFE protein is classified as a two domain protein (Alpha Beta, Mainly Beta)<ref>http://www.cathdb.info/domain/1a6zA01</ref>. We found both domains with a sequence similarity of 100%. We than used BLAST to test the results at random with another search against CATH. We found for several proteins the same sequence identity distribution. With this BLAST search, we are now sure HFE is a protein with a high conservation in structure elements but a very weak sequence conservation. Therefore we would recommend a new acceptance range of about 20% to 40% sequence similarity for this protein.

I-Tasser

Figure 2: perfomance of I-TASSER at CASP
Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html
Figure 3: Workflow of the I-Tasser server
Source: http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html

I-Tasser<ref>Yang Zhang. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins</ref> is a webservice for protein structure prediction provided and published by Ambrish Roy, Alper Kucukural and Yang Zhang at http://zhanglab.ccmb.med.umich.edu/I-TASSER/ for the CASP competition with outstanding achievement (Figure 2).

The I-Tasser protocol consists of several steps which are:

  • threading the sequence into different structures to create an initial template.
  • break the template apart into fragments which match the structure (leave the parts of the structure out to which no sequence is assigned).
  • Structure assembly and clustering
  • use the cluster centroid for structure reassembly
  • search the structure with the lowest energy and do REMO H-bond optimization to get the final model.


A graphical workflow is shown in Figure 3.

Further on, I-Tasser also predicts GO-Terms and binding sites. Therefore it uses the final model to search for global and local matches in the PDB to predict these terms.

For us, a problem is that I-Tasser only generates complete models, but the PDB structure of our protein is not complete. Therefore we compared the predicted secondary structure with the one form UniProt.

Compare secondary structure of the model and the structure assigned in UniProt:

Seq:  MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMEN
Pred: CCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCEEEEECCCCCCCCCCEEEEEEECCCEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHH
UniP: ---------------------------EEEEEEEEEEE----EEE--EEEEEE--EEEEEEEEEE--EEE--------TTTHHHHHHHHHHHHHHHHHHHHHHHHHHT

Seq:  HNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHV
Pred: HCCCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCEEEECCCHHHCHHHHHHHHHHHHHHHHCCCHHHHHHHHHCCCCHHHHHHHHHCCHHHHHCCCCCCCCCCCCC
UniP: TT-EEE--EEEEEEEEEE-----EEEEEEEEE--EEEEEEEHHH-EEEEEE---HHHHHHHH---HHHHHHHHHHH-HHHHHHHHHHHHHTTT-------EEEEEEEE

Seq:  TSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGI
Pred: CCCHHHHCHHHHCCCCCCEEEEEEECCCCCCCCCCEEEECCCCCCCCCCCEEEEECCCCCCCCEEEECCCCCCCCCEEEECCCCCCCCCCCCCCCCHHHHHHHCCHHH
UniP: ----EEEEEEEEEEEEE--EEEEEE------HHH----EEEE-----EEEEEEEEE---HHHHEEEEEE---EEE-EEEE----------------------------

Seq:  LFIILRKRQGSRGAMGHYVLAERE
Pred: HHHHHHCCCCCCCCCCCCCHCCCC
UniP: ------------------------

For a better overview we replaced the I-Tasser S for Sheet by an E like in the UniProt secondary structure.

As we can see, the secondary structure predicted by I-Tasser is mostly correct. Sometimes we see a slightly shift in the structure and sometimes the secondary structure elements have not the correct length. As this model is also based on a self hit, it is not a surprise to see a good results like this one.

Predicted Secondary Structure by I-Tasser

Sequence:   MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDF
Predicted:  CCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCSSSSSCCCCCCCCCCSSSSSSSCCCSSSSCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHH
Conf-Score: 985028899999999899875122045421036641367999985269985643743686068998778788540145583478888887676654315558

Sequence:   WTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQ
Predicted:  HHHHHHHCCCCCCSSSSSSSCCCCCCCCCCCCCCCCCCCCCCSSSSCCCHHHCHHHHHHHHHHHHHHHHCCCHHHHHHHHHCCCCHHHHHHHHHCCHHHHHC
Conf-Score: 888755315777644463525565898763541000558873365263022202455666677878887004598888767064299999999747666642

Sequence:   QVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLV
Predicted:  CCCCCCCCCCCCCCCHHHHCHHHHCCCCCCSSSSSSSCCCCCCCCCCSSSSCCCCCCCCCCCSSSSSCCCCCCCCSSSSCCCCCCCCCSSSSCCCCCCCCCC
Conf-Score: 599877567699854442101541541332479864358754456553541024888652112699807986310267512589998726840688766531

Sequence:   IGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
Prediction: CCCCCCHHHHHHHCCHHHHHHHHHCCCCCCCCCCCCCHCCCC
Conf-Score: 010211112222100246665443013678898651020169

Secondary structure elements are shown as H for Alpha helix, S for Beta sheet & C for Coil

Predicted Solvent Accessibility by I-Tasser

Sequence:   MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDF
Prediction: 723312000000000101112222011200120120023333331200000102322003123724434241311436413610352044144313323230

Sequence:   WTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQ
Prediction: 220132133351310001010021136231211333023032003016303403102321432433044143404422010333005103400630351154

Sequence:   QVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLV
Prediction: 342353313321443300000100101014010203346564435434135233334221320000000347533120214264144202020214542200

Sequence:   IGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
Prediction: 000001100000011100000001334446443132333438

Values range from 0 (buried residue) to 9 (highly exposed residue)

I-Tasser predicted five Models with a C-Score from -0.557 to -3.298. They are ranked from one to five as seen below. As cutoff for the C-Score, we use -1.5 as recommended by the Zhang group<ref>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010). Zhang et al.</ref> that is proposed to give a false-positive and a false-negative rate of about 0.1. That means that more than 90% of the quality predictions are correct. Therefore we just use Model1 for the comparison with the other methods. All resulting models are shown below in Figure 4 to Figure 8.

Figure 4: Model 1 with a C-Score of -0.557
Figure 5: Model 2 with a C-Score of -2.539
Figure 6: Model 3 with a C-Score of -2.266
Figure 7: Model 4 with a C-Score of -2.772
Figure 8: Model 5 with a C-Score of -3.298

Model1 has a TM-Score of about 0.64 and a RMSD of 7.7Å. For the prediction, I-Tasser used 1a6zA, 1s7qA, 1i4fA, 1de4A, 2vabA and 2bckA as templates. The templates have an identity of about 40% except for the self hit 1a6z. A special case is 1de4 which is the transferin receptor, but in complex with the HFE protein (chain A) which is a self hit as well. The sequence in this case is also identical, but we can not give any conclusion about the 3D structure of the protein bind to the receptor. Because of the self hit, we run I-Tasser a second time with the constrain to exclude all templates with a sequence identity > 80%.

I-Tasser using templates with a sequence identity below 80% to avoid self hits.
The second run brought to our surprise the same results based also on the same self hit. We have at this point no idea what went wrong but because the self hit is just one out of five templates used to create the model, we decided to keep the best model (Model1) for the comparison with the other methods.

SwissModel

SwissModel<ref>Benkert P, Künzli M, Schwede T. QMEAN server for protein model quality estimation.</ref> is a server based tool provided by the SIB. It combines tools like PSI-PRED and DISOPRED for secondary structure and disordered region prediction.
The SwissModel workspace is a web-based service dedicated to protein structure homology modeling. It provide a personal working environment where several projects can be calculated parallel. The environment provide tools for template selection, model building and structure quality evaluation as well. To find suitable templates for a given target protein a library of experimental protein structures is searched<ref>http://bioinformatics.oxfordjournals.org/cgi/content/short/22/2/195</ref>.
The SwissModel repository is a database of annotated 3d protein structure models. The database consists of more than 3.4 million structures<ref>http://nar.oxfordjournals.org/content/37/suppl_1/D387.full</ref>. All models were generated from the UniProt database with the SwissModel pipeline. Form the SwissModel repository the density of the QMEAN-Score is estimated to give a dent of the model quality of the predicted model.


The model created by SwissModel is based on a self hit, but we had no chance to exclude the protein itself from the prediction. We could just set a specific template, therefore we also run SwissModel in Alignment-Mode. So we had the chance to influence the alignment. And as one can see, the density of the QMEAN-Score and of the Automated mode and the Alignment mode are the same. Therefore the target (1a6z) and the template (1bii) are part of the same reference set. We take this as an indicator for a good template choice, because the template is in the same set as the target which is also used as a template in the Alignment mode. Therefore we rated this as evidence for the high diversity of the MHC 1 family.

Automated Mode

Figure 9: predicted model


Model information: Modelled residue range: 26 to 297
Based on template: 1a6zC (2.60 Å)
Sequence Identity [%]: 100
Evalue: 7.66e-163

Quality information: QMEAN Z-Score: -1.035


Figure 10: Estimated absolute model quality
Figure 11: Estimated density of model quality
Figure 12: Z-Score by category
Figure 13: predicted error for each position.

Even though the model is based on a self hit, the Z-Score is about -1, which means that the model is one standard deviation from the mean. The model is not quite unlikely but also not the most probable one. Figure 9 shows the predicted structure based on the template. The typical two parallel helices on a beta-sheet are clearly observable. Figure 10 shows the QMEAN4-score distribution over the protein size. Figure 11 shows the density plot of the reference set. The set contains the score of structures with a similar size. Figure 11 shows the different scores which are used to calculate the final QMEAN score. Here we can see, that the torsion angles caused the most issues, which leads to a lower QMEAN score. Figure 12 shows the predicted error for each residue on an arbitrary scale. We see a higher error at the beginning, but more or less the same pattern (pattern size of about 100 aa) of error values over the whole protein.

Alignment Mode

Figure 14: predicted model

Model information:
Modelled residue range: 1 to 272
Based on template: 1bii_A

Quality information:
QMEAN Z-Score: -2.065


Figure 15: Estimated absolute model quality
Figure 16: Estimated density of model quality
Figure 17: Z-Score by category
Figure 18: predicted error for each position
TARGET    26                                 RSH SLHYLFMGAS EQDLGLSLFE
1biiA     1                                  gsh slryfvtavs rpgfgeprym                                                                     
TARGET                                       sss ssssssssss        sss
1biiA                                        sss ssssssssss        sss
TARGET    49    ALGYVDDQLF VFYDHES--R RVEPRTPWVS SRISSQMWLQ LSQSLKGWDH
1biiA     24    evgyvdntef vrfdsdaenp ryeprarwie -qegpeywer etrrakgneq                                                                      
TARGET          ssssss sss sssss        sss  hhh hh   hhhhh hhhhhhhhhh
1biiA           ssssss sss sssss        sss  hh       hhhhh hhhhhhhhhh
TARGET    97    MFTVDFWTIM ENH-NHSKES HTLQVILGCE MQEDNST-EG YWKYGYDGQD
1biiA     73    sfrvdlrtal ryynqsaggs htlqwmagcd vesdgrllrg ywqfaydgcd                                                                     
TARGET          hhhhhhhhhh hhh        ssssssssss sss sss ss sssssss ss
1biiA           hhhhhhhhhh hhh        ssssssssss sss  sssss sssssss ss
TARGET    145   HLEFCPDTLD WRAAEPRAWP TKLEWERHKI RARQNRAYLE RDCPAQLQQL
1biiA     123   yialnedlkt wtaadmaaqi trrkweqa-g aaerdrayle gecvewlrry                                                                     
TARGET          sssss    s ss      hh hhhhh       hhhhhhhhh hhhhhhhhhh
1biiA           sssss    s ss     hhh hhhhhhh     hhhhhhhhh hhhhhhhhhh
TARGET    195   LELGRGVLDQ QVPPLVKVTH HVTS-SVTTL RCRALNYYPQ NITMKWLKDK
1biiA     172   lkngnatllr tdppkahvth hrrpegdvtl rcwalgfypa ditltwqln-                                                                     
TARGET          hhh            ssssss sss   ssss ssssss       sssssss 
1biiA           hhh            ssssss sss   ssss ssssss       sssss   
TARGET    244   QPMDAKEFEP KDVLPNGDGT YQGWITLAVP PGEEQRYTCQ VEHPGLDQPL
1biiA     221   geeltqemel vetrpagdgt fqkwasvvvp lgkeqkytch veheglpepl                                                                     
TARGET                  ss s  sss   s sssssssss       sssss ss       s
1biiA                   ss s  sss   s sssssssss         sss ss       s
TARGET    294   IVIW                                                  
1biiA     271   tlrw-                                                                                                                      
TARGET          ss                                                    
1biiA           ss

As one can see, a very similar secondary structure in this alignment is shown, and also a very similar 3d structure. The RMSD for the model is about 2.9. This is a quite good results but just the residues which are superimposed are used for the calculation. So the missing beta-sheet is not a part of the calculation. But in general, we see results, comparable to the self hit model. Figure 15 and Figure 16 also show the score distribution compared to other models of the same size. In Figure 17, we are able to see, also the torsion angles causes the main issues like in the self hit model. This could mean, that the torsion angles in this protein are not that obvious. The predicted error shown in Figure 18 shows a comparable patterning like the predicted error of the self hit model shown in Figure 13. But the high peak in the beginning is missing.

MODELLER

MODELLER<ref>Eswar N. et. Al. Comparative protein structure modeling using MODELLER.</ref> is a standalone application used for protein structure modeling by satisfying spatial restraints. These restraints derive from different types of information, so the model is not only based on the target-template alignment (but it also could). MODELLER is capable of pairwise/multiple alignment, fold assignment and modeling of loops.

We downloaded and installed Modeller locally to our Windows PC and used the examples given at the Workflow homology modeling glucocerebrosidase.

Our target has been set to the FASTA sequence of HFE_HUMAN. Our standard template for the single template-target alignment has been set to chain A of 1BII, because it covers the whole sequence of the HFE_HUMAN. For the multiple sequence alignment we used additional to 1BII the protein structures 1S79 and 3P73. Both, 1S79 and 3P73 were chosen because of the relative high sequence indentity of about 37% of 1S79 and because 3P73 is a classical MHC class I molecule with a similar function to the HFE_HUMAN protein.

Single template-target

Scripts

script_pairwise-alignment-template-target.py

from modeller import *

env = environ()
aln = alignment(env)
mdl = model(env, file='1BII.pdb', model_segment=('FIRST:A', 'END:A'))
aln.append_model(mdl, align_codes='1BII', atom_files='1BII.pdb')
aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN')
aln.align2d()
aln.check()
aln.write(file='pairwise-2d.ali', alignment_format='PIR') 
aln.align()
aln.check()
aln.write(file='pairwise.ali', alignment_format='PIR')

script_pairwise-to-model.py

from modeller import *
from modeller.automodel import *

env = environ() 
a = automodel(env,
            alnfile  = 'pairwise.ali', #file:pir:alignment
            knowns   = '1BII',               #file:pdb:template
            sequence = 'HFE_HUMAN',          #id:target
            assess_methods=(assess.DOPE, assess.GA341))
a.starting_model= 1                
a.ending_model  = 1                
a.make()
b = automodel(env,
            alnfile  = 'pairwise-2d.ali', #file:pir:alignment
            knowns   = '1BII',               #file:pdb:template
            sequence = 'HFE_HUMAN',          #id:target
            assess_methods=(assess.DOPE, assess.GA341))
b.starting_model= 2                
b.ending_model  = 2                
b.make()

Alignments

We used two different alignments for Modeller, one without use of structural information at the template side:

pairwise.ali

>P1;1BII
structureX:1BII.pdb:   1 :A:+383 :P:MOL_ID  1; MOLECULE  MHC CLASS I H-2DD; CHAIN  A; FRAGMENT  HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM  DD; ENGINEERED  YES; MOL_ID  2; MOLECULE  BETA-2 MICROGLOBULIN; CHAIN  B;
ENGINEERED  YES; MOL_ID  3; MOLECULE  DECAMERIC PEPTIDE; CHAIN  P; ENGINEERED  YES:MOL_ID  1; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON  HOUSE MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; 
EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-3A; MOL_ID  2; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON 
HOUSE   MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-8C; MOL_ID  3: 2.40: 0.28
-------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI
ALNEDLKTWTAADMAAQITRRKWEQAGAAER-DRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD
VTLRCWALGFYPADITLTWQLNGEEL-TQEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL
RW/IQKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTEFTP
TETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI*

>P1;HFE_HUMAN
sequence:reference:     : :     : :::-1.00:-1.00
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDH--ESRRVEPRTP
WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNST-EGYWKYGYDGQDHL
EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV
TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV
IW-------------EPSPSGTLVI---------------------GVISGIAVFVVILFIGILFIILRKRQGSR
GAMGHYV-------LAERE-------------------*

And one with the use of structural information at the template side:

pairwise-2d.ali

>P1;1BII
structureX:1BII.pdb:   1 :A:+383 :P:MOL_ID  1; MOLECULE  MHC CLASS I H-2DD; CHAIN  A; FRAGMENT  HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM  DD; ENGINEERED  YES; MOL_ID  2; MOLECULE  BETA-2 MICROGLOBULIN; CHAIN  B;
ENGINEERED  YES; MOL_ID  3; MOLECULE  DECAMERIC PEPTIDE; CHAIN  P; ENGINEERED  YES:MOL_ID  1; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON  HOUSE MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; 
EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-3A; MOL_ID  2; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON 
HOUSE MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-8C; MOL_ID  3: 2.40: 0.28
---------------------G----SHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI
ALNEDLKTWTAADMAAQITRRKWE-QAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD
VTLRCWALGFYPADITLTWQLNGEELT-QEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL
RW/I---QKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTE
FTPTETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI*

>P1;HFE_HUMAN
sequence:reference:     : :     : :::-1.00:-1.00
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYD--HESRRVEPRTP
WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSK-ESHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHL
EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV
TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV
IW-EPSPSGTLVIGVIS---------GIAVFVVILF--IGILFIILRK-RQGSRGAMGH---------YVLAERE
-----------------------------------------*

The models will be presented under model comparison, but surprisingly the model with the structural information is worse than the model without. We think Modeller has some issues to threader the sequence of HFE_HUMAN into the given structure if 1BII. Therefore, we derive the possibility that 1a6z, which have a very similar structure to 1bii, has a different amino acid composition for this type of structure. But at the moment we have no chance to test and prove this.

Alignment: multiple template-target

Scripts

script_msa-align-templates.py

from modeller import *

env = environ()
aln = alignment(env)
for (code, chain) in (('1BII', 'A'), ('1S79', 'A'), ('3P73', 'A')):
  mdl = model(env, file=code, model_segment=('FIRST:'+chain, 'LAST:'+chain))
  aln.append_model(mdl, atom_files=code, align_codes=code+chain)
aln.salign()
aln.check()
aln.write(file='MSA.ali', alignment_format='PIR')

script_msa-align-target-to-msa.py

from modeller import *

env = environ()
aln = alignment(env)
aln.append(file='MSA.ali', align_codes='all')
aln_block = len(aln)
aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN')
aln.salign()
aln.check();
aln.write(file='MSA.ali', alignment_format='PIR')

script_msa-to-model.py

from modeller import *
from modeller.automodel import *

env = environ() 
a = automodel(env,
            alnfile  = 'MSA.ali', #file:pir:alignment
            knowns   = ('1BIIA', '1S79A', '3P73A'),               #file:pdb:template
            sequence = 'HFE_HUMAN',          #id:target
            assess_methods=(assess.DOPE, assess.GA341))
a.starting_model = 1
a.ending_model = 1
a.make()

Alignment

The MSA used by Modeller is:

>P1;1BIIA
structureX:1BII:1    :A:+274 :A:MOL_ID  1; MOLECULE  MHC CLASS I H-2DD; CHAIN  A; FRAGMENT  HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM  DD; ENGINEERED  YES; MOL_ID  2; MOLECULE  BETA-2 MICROGLOBULIN; CHAIN  B; 
ENGINEERED  YES; MOL_ID  3; MOLECULE  DECAMERIC PEPTIDE; CHAIN  P; ENGINEERED  YES:MOL_ID  1; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON  HOUSE MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; 
EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-3A; MOL_ID  2; ORGANISM_SCIENTIFIC  MUS MUSCULUS; ORGANISM_COMMON  HOUSE 
MOUSE; ORGANISM_TAXID  10090; CELL_LINE  BL21; EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID  PET-8C; MOL_ID  3: 2.40: 0.28
-------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR
WIEQEGPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYIA
LNEDLKTWTAADMAAQITRRKWEQAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGDVT
LRCWALGFYPADITLTWQLNGEELTQ-EMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTLRW
---------------------------------------------------*

>P1;1S79A
structureX:1S79:100  :A:+103 :A:MOL_ID  1; MOLECULE  LUPUS LA PROTEIN; CHAIN  A; FRAGMENT  CENTRAL RRM; SYNONYM  SJOGREN SYNDROME TYPE B ANTIGEN, SS-B, LA RIBONUCLEOPROTEIN, LA AUTOANTIGEN; ENGINEERED  YES:MOL_ID
1; ORGANISM_SCIENTIFIC  HOMO SAPIENS; ORGANISM_COMMON  HUMAN; ORGANISM_TAXID  9606; GENE  SSB; EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  BL21(DE3)PLYSS; 
EXPRESSION_SYSTEM_VECTOR  PET28:-1.00:-1.00
-------------------------GRWILKNDVKNRSVYIKGFPTDATLDDIK---------------------
---------------------------------------------------------------------------
----------------------------------------EWLEDKGQVLNIQMRRT------------------
--------------LHKAFKGSIFVV-FDSIESAKKFVETPGQKYKETDLLILFKDDYFAKKNEERKQNKVE---
---------------------------------------------------*

>P1;3P73A
structureX:3P73:-1   :A:+275 :A:MOL_ID  1; MOLECULE  MHC RFP-Y CLASS I ALPHA CHAIN; CHAIN  A; FRAGMENT  UNP RESIDUES 20-294; ENGINEERED  YES; MOL_ID  2; MOLECULE  BETA-2-MICROGLOBULIN; CHAIN  B; ENGINEERED 
YES:MOL_ID  1; ORGANISM_SCIENTIFIC  GALLUS GALLUS; ORGANISM_COMMON  BANTAM,CHICKENS; ORGANISM_TAXID  9031; GENE  YFV; EXPRESSION_SYSTEM  ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  
TB1; EXPRESSION_SYSTEM_VECTOR_TYPE  PLASMID; EXPRESSION_SYSTEM_PLASMID  PMAL-P4X; MOL_ID  2; ORGANISM_SCIENTIFIC  GALLUS GALLUS; ORGANISM_COMMON  BANTAM,CHICKENS; ORGANISM_TAXID  9031; GENE  B2M; EXPRESSION_SYSTEM 
ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID  562; EXPRESSION_SYSTEM_STRAIN  TB1; EXPRESSION_SYSTEM_VECTOR_TYPE  PLASMID; EXPRESSION_SYSTEM_PLASMID  PMAL-P4X: 1.32: 0.16
-----------------------EFGSHSLRYFLTGMTDPGPGMPRFVIVGYVDDKIFGTYNSKSRTA--QPIVE
MLPQEDQEHWDTQTQKAQGGERDFDWNLNRLPERYNKSKG-SHTMQMMFGCDILEDGS-IRGYDQYAFDGRDFLA
FDMDTMTFTAADPVAEITKRRWETEGTYAERWKHELGTVCVQNLRRYLEHGKAALKRRVQPEVRVWGKEADGILT
LSCHAHGFYPRPITISWMKDGMVRDQ-ETRWGGIVPNSDGTYHASAAIDVLPEDGDKYWCRVEHASLPQPGLFSW
EP------------------------------------------------Q*

>P1;HFE_HUMAN
sequence:reference:     : :     : :::-1.00:-1.00
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVE-PRTPW
VSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHLE
FCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTT
LRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIW
EPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE*

Model editing

We tried to edit the single-template model of MODELLER, because it is one of our best models. As we looked at our alignment with Jalview 2.6 (Figure 19), we noticed that the alignment is already very well defined and changes will only lead to worse results. The average conservation is about 7 to 8 and the quality around 5 to 6.

Figure 19: visualization of the single-template model of MODELLER done by Jalview

The hydrophobic groups are also very well aligned, so we decided to leave that model as it is, because there is nothing to edit. Only the end of the alignment has much gaps, but shifting the gaps would result in a break of the conserved block in the middle of the alignment.

The only difference between the see-supported model and the single-template model are the different aligned residues (Figure 20). These result from the information about the secondary structure of the template incorporated into the model and thus we will not edit them.

Figure 20: visualization of the see-supported model of MODELLER done by Jalview

It is hard to edit the msa model because of the multiple alignments between the different sequences. We tried changing some aligned groups to different position inside the sequence alignment, but were not able to manage the corresponding alignment at the other sequences. After some unfruitful tempts we decided to leave also that alignment as it is (Figure 21).

Figure 21: visualization of the multi-template model of MODELLER done by Jalview


In a summary, we have not edited any alignment successful because there was nothing to edit or it was too complicated and introduced too much errors.

Model comparison

3D-Jigsaw

We had several issues with the execution of 3D-Jigsaw<ref>Bates, P.A., Kelley, L.A., MacCallum, R.M. and Sternberg, M.J.E. (2001) Enhancement of Protein Modelling by Human Intervention in Applying the Automatic Programs 3D-JIGSAW and 3D-PSSM.</ref>, like strange error messages and non accepting of our input. Finally we got it to work with the following instruction:

Result

Name Data
Length _________10________20________30________40________50________60________70________80________90________100_______110_______120_______130_______140_______150_______160_______170_______180_______190_______200_______210_______220_______230_______240_______250_______260_______270___
AA RLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIW
Prediction CCCCEEEEEEEEEECCCCCCCCCCEEEEEEECCCCEEEECCCCCCCCCHHHHHHCCCCCHHHHHHHHHHHHHCCHHHHHHHHHHHHHCCCCCCEEEECCCCCEECCCCCCCCEEECCCCCCEEEEECHHHHHHCCCCCHHHCCHHHHCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCEEEEEEEEECCEEEEEEEECCCCCCCEEEEEEECCEECCHHHEEEEEEECCCCCCCCCEEEEEECCCCCCCEEEEEEECCCCCCEEEEC
Confidence 93303453556763258999987358999894885499848988867403652146670222210276553000136609989986528798168852425544699858524402146712899971352101652001125776224548999999999998999999999987887332589869999951389499999762611761499996677566832279987550889983003699826988533699999504888767859
Disorder DDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODOOOOOOOOOODDDDDDDDDDOOOODDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDOOODDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOODDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDOOOOOOOOOOOODOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODOOODD

3D-Jigsaw gave us also information about the predicted secondary structure and the ordered and disordered regions. It used that information to successfully optimize all of our submitted models. All optimized models have an energy around ~ -340 and an coverage of 0.99, which is really good. Their date and pictures are visible at the table below.

Model evaluation

After trying serveral tools (RasWin, JMol, SwissPDB-Viewer), we decided to use PyMol for superimposing and displaying the model-target alignment of the proteins. We truncated the original HFE_HUMAN protein (pdbid: 1a6z) at chain C, thus we used only chain A and B for displaying. The original HFE_HUMAN is always shown in green and the model in red (see table below).

We created our models using PyMol by:

  • load '1A6Z_AB.pdb' into PyMol (alternatively: command 'fetch 1A6Z' and then hide chain C and D)
  • hide everything
  • show cartoon
  • color red
  • load 'model.pdb' into PyMol
  • hide everything
  • show cartoon
  • color green
  • align 'model.pdb' to '1A6Z_AB.pdb'
  • command 'ray' (nicer output!)
  • save the image


For evaluating our models with the RMSD and TMScore we used TMalign. We were advised to use SAP for the RMSD and TMScore for the TMScore but TMScore failed because our target is the sequence of the HFE_HUMAN from UniProt and therefore longer than the '1BII' template. This causes a problem with TMScore because it needs pdbs with same length and the thus the superimposing of TMScore does not really work.

TMalign is able to use pdbs with different length and the scores are normalized by the second structure. We use '1A6Z' as second structure to create comparable scores of all our models. The modeling of HFE_HUMAN is very difficult because it is a multi domain protein. All the methods do not support a multi domain modeling.

TMalign can be found at the website of the Zhang-Lab.


Picture Model RMSD TM-Score Optimized picture Optimized RMSD Optimized TM-Score 3D-JigSaw energy calculation
MODELLER: superimposed, green:1a6z, red:model(1BII)
MODELLER: superimpose, template:1BII 2.58 0.86468
Optimized MODELLER pw-model by 3D-Jigsaw
1.70 0.95082 -341.87
MODELLER: sse-support, superimposed, green:1a6z, red:model(1BII)
MODELLER: superimpose, see-support, template:1BII 3.42 0.59586
Optimized MODELLER sse-model by 3D-Jigsaw
0.98 0.96990 -341.41
MODELLER: msa, superimposed, green:1a6z, red:model(1BII,1S79,3P73)
MODELLER: superimpose, msa, template:1BII,1S79,3P73 2.05 0.89042
Optimized MODELLER msa-model by 3D-Jigsaw
1.70 0.95087 -341.23
I_Tasser: superimposed, green:1a6z, red:model
I-Tasser 1.61 0.93760
Optimized I_Tasser model by 3D-Jigsaw
2.48 0.87855 -339.33
SwissModel: superimposed, green:1a6z, red:model
SwissModel 2.67 0.85048
Optimized SwissModel model by 3D-Jigsaw
2.48 0.87851 -339.17
SwissModel: self-hit, superimposed, green:1a6z, red:model
SwissModel self 0.08 0.99984

As one can clearly see, the I-Tasser model is the best with an TM-Score ~0.94 followed by the MSA model of MODELLER with an TM-Score of ~0.89 and the SwissModel with an TM-Score of ~0.85.

The worst model is the secondary structure supported information at the template site model of MODELLER with an TM-Score of ~0.6. We are sure, that the low sequence identity and secondary structure similarity of only 22% affected this model the bad way, because the normal model is also based on the same template and achieves an significantly higher TM-Score.

All of our models are really good, except for the sse-supported model of MODELLER.

After optimization by 3D-Jigsaw all MODELLER model are much better because 3D-Jigsaw cut off those clearly wrong modeled strands of useless amino acids. Surprisingly, the worse sse-supported model is now the best of all MODELLER models and even better than the previous best model of I-Tasser. It is not surprising that 3D-Jigsaw was also able to optimize the Swissmodel model but failed at the I-Tasser model, because it incorporated the information of the not so well done models. But surprisingly, the RMSD of the I-Tasser model got worse after the 3D-Jigsaw optimization.

Our models are still all very good, but the best one is now the sse-supported model of MODELLER with an RMSD of below 1 and and TM-Score of almost 1; it is now almost an perfect model. The second and third best models are standard pairwise and the msa model of MODELLER which are now very similar according to the RMSD and TM-Score. The I-Tasser and SwissModel model are now both very similar, too.

Discussion

For the I-Tasser protocol, it is not possible to choose a specific template, so we run I-Tasser twice, first with standard parameter, and one with a similarity threshold of 80%. In the second case, we got a model also based on a self hit. So we repeated the prediction a third time with the same result. We were not able to find out for what reason the given threshold was ignored.

Our attempts to get homologous at all given categories (>60%, >40%, >20%) was not successful, because HHSearch was not able to list matching ones. Doing a Blast search against the NR-Database also failed to provide acceptable results and resulted only in proteins with 40% or less sequence identity. Thus we come to the conclusion, that the HFE family must have a very high diversity of the sequence by a high structural conservation. This theory got supported as we did an alignment of structural homologous listed in CATH.

The templates which we had chosen from the HHSearch were used to cover the whole protein sequence and give a special coverage of the transmembrane region. But as we saw later the tools do not support multiple sequence alignments. Therefore we decided to use '1BII' as template for SwissModel and Modeller because it covers the sequence completely and with a sequence similarity of 22% it is in the lower midrange of the HHSearch results. '1S79' has with 37% more sequence identity but also a very worse conservation with HFE_HUMAN. We decided to rank coverage of the whole sequence higher than the sequence identity.

After this task, we would suggest SwissModel to use in the first place to get a quick overview and a first idea about the protein structure. We also would advice I-Tasser because of its nice usability. The Modeller approach we would just advise for experts, which are really interested in a special alignment, as the usability is awful for layman.

Extra diligence task

We were not able to perform the task of calculating the RMSD of all atoms inside an 6 Angström threshold of the catalytic core, because there is no one defined at UniProt:Q30201(HFE_HUMAN) and also not at PDB:1A6Z.

References

<references />