Difference between revisions of "Homology based structure predictions"
m (→Model comparison) |
m (→Analysis of models) |
||
Line 476: | Line 476: | ||
=== Analysis of models === |
=== Analysis of models === |
||
+ | All models generated by MODELLER are visible at [[Model comparison]]. |
||
− | TODO |
||
− | |||
− | We used the C-alpha and TM-Score for evaluating our models. |
||
==Model comparison== |
==Model comparison== |
Revision as of 21:00, 19 June 2011
Contents
Homologous
Because we found no homologous structures in Task 2, we extended our list by using HHSearch.
HHSearch found just sequences with an indentity below 40% therefore we will use the 12 proteins shown below for creating a multiple alignment for homologous modeling. We choose sequences to cover the whole protein and we pay specific attention on the transmembrane region.
PDB-ID | Identity | Description |
1S79 | 37% | human La protein |
2WY3 | 29% | HCMV UL16-MICB complex |
3P73 | 28% | classical MHC class I molecule |
3JTS | 25% | Mamu A*2 |
1KCG | 22% | NKG2D |
1BII | 22% | H-2DD MHC CLASS I |
1OW0 | 22% | human FcaRI |
2P24 | 21% | alphabeta TCR |
1CD1 | 21% | MHC-like fold with a large hydrophobic binding groove |
1HXM | 18% | Human Vgamma9/Vdelta2 T Cell Receptor |
1LQV | 14% | Endothelial protein C receptor |
1JFM | 14% | MURINE NK CELL LIGAND RAE-1 BETA |
With these sequences including the HFE-Gen(Q30201), we did a multiple sequence alignment with t-coffee(EXPRESSO). This multiple sequence alignment is later used as a raw alignment in the Alignment Mode of SwissModel and Modeller. Later on, we will try to fit better models by editing the alignment by keeping functional regions together.
DSSP --EEEEEEEEEEB-SS-SSB--EEEEEETTEEEEEEESSS--EEE--STTS-SSTTTTHHHHHHHHHHHHHHHHH Q30201 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHES--RRVE-PRTPWVSSRISSQMWLQLSQSLKGWDHM 1S79_A --------------------------------------------------------------------GRW-IL-KNDVKNRSVYIKGFPTDATLDDIKE 3P73_A -----------------------EFGSHSLRYFLTGMTDPGPGMPRFVIVGYVDDKIFGTYNSKS--RTAQ-PIVEML-PQEDQEHWDTQTQKAQGGERD 1KCG_C -------------------------DAHSLWYNFTIIHLPRHGQQWCEVQSQVDQKNFLSYDCGS--DKVLSMGHL-EEQLYATDAWGKQLEMLREVGQR 1JFM_A -------------------------DAHSLRCNLTIKDPTPADPLWYEAKCFVGEILILHLSNIN--KTMT-SG-DPGETANATEVKKCLTQPLKNLCQK 1BII_A -MGAMAPRTLLLLLAAALGPTQTRAGSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYE-PRARWIE-QEGPEYWERETRRAKGNEQS 2P24_A ----------------------------------------------------------------------------M----AIMAPRTLVLLLSGALALT 1CD1_A -----------------------QQKNYTFRCLQMSSFANR-SWSRTDSVVWLGDLQTHRWSNDS--ATIS-FTKPWSQGKLSNQQWEKLQHMFQVYRVS 2WY3_A ------------------------MEPHSLRYNLMVLSQDESVQSGFLAEGHLDGQPFLRYDRQK--RRAK-PQGQWAEDVLGAETWDTETEDLTENGQD 1LQV_A -------------------SQDASDGLQRLHMLQISYFR-DPYHVWYQGNASLGGHLTHVLEGPDTNTTII-QLQPL----QEPESWARTQSGLQSYLLQ 3JTS_A -------------------------GSHSMRYFYTSMSRPGRWEPRFIAVGYVDDTQFVRFDSDAASQRME-PRAPWVE-QEGPEYWDRETRNMKAETQN 1OW0_A ---------------------------------------------------------------------------------------------------- 1HXM_A -------------------------------------------------------------------------------------AIELVPEHQTVPVSI DSSP HHHHHHHHTTT-SSS--E--------EEEEEE-EEE-TTS-E-EEE-E------------EEEETTEE----------------EEEEEGGGTEEEES-- Q30201 FTVDFWTIMENHN-HSKE--------SHTLQV-ILGCEMQED-NST-E------------GYWKYGYD----------------GQDHLEFCPDTLDW-- 1S79_A WLEDKGQV-LNIQMRRTL--------HKAFKG-SIFVVFDSI-ESA-KKFVETPGQKYKETDLLILFKDDYFAKKNEERKQNKVE--------------- 3P73_A FDWNLNRLPERYN-KSKG--------SHTMQM-MFGCDILED-GSI-R------------GYDQYAFD----------------GRDFLAFDMDTMTF-- 1KCG_C LRLELADT---------ELEDFTPSGPLTLQV-RMSCECEAD-GYI-R------------GSWQFSFD----------------GRKFLLFDSNNRKW-- 1JFM_A LRNKVSNT-KVDTHKTNG--------YPHLQV-TMIYPQSQG-RTP-S------------ATWEFNIS----------------DSYFFTFYTENMSW-- 1BII_A FRVDLRTALRYYNQSAGG--------SHTLQW-MAGCDVESD-GRLLR------------GYWQFAYD----------------GCDYIALNEDLKTW-- 2P24_A QTWAGSHSRGEDD--IEA--------DHVGSYGIVVYQSP----GD-I------------GQYTFEFD----------------GDELFYVDLDKKET-- 1CD1_A FTRDIQELVKMMSPKEDY--------PIEIQL-SAGCEMYPG-NAS-E------------SFLHVAFQ----------------GKYVVRFWG--TSWQT 2WY3_A LRRTLTHI----KDQKGG--------LHSLQE-IRVCEIHED-SST-R------------GSRHFYYN----------------GELFLSQNLETQES-- 1LQV_A FHGLVRLVHQERT--LAF--------PLTIRC-FLGCELPPEGSRA-H------------VFFEVAVN----------------GSSFVSFRPERALW-- 3JTS_A APVNLRNLRGYYNQSEAG--------SHTIQR-MYGCDLGPD-GRLLR------------GYHQSAYD----------------GKDYIALNEDLRSW-- 1OW0_A -----ACHPRLSLHRPAL--------EDLLLG-SEANLTCTL-TGLRD------------ASGVTFTW----------------TPSSGKSAV--QGPPE 1HXM_A GVPATLRCSMKGEAIGNY--------YINWYR-KTQGNTMTF-IYRE-------------KDIYGPGF----------------KDNFQGDIDIAKNL-- DSSP SGG-G----HHH-HHHHHSSTHHH--HHHHHHHHTHHHHHHHHHHHHHTTTSS--B--EEEEEEEE-SS-----E-EEEEEEEEEBSS--EEEEEETTEE Q30201 RAA-E----PRA-WPTKLEWERHK--IRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVT----S-SVTTLRCRALNYYPQNITMKWLKD 1S79_A ---------------------------------------------------------------------------------------------------- 3P73_A TAA-D----PVA-EITKRRWETEG--TYAERWKHELGTVCVQNLRRYLEHGKAALKRRVQPEVRVWGKEA----D-GILTLSCHAHGFYPRPITISWMKD 1KCG_C TVV-H----AGA-RRMKEKWEKDS--GLTTFFKMVSMRDCKSWLRDFLMHRKKRLE-------------------------------------------- 1JFM_A RSA-N----DES-GVIMNKWKDDG--EFVKQLKFLI-HECSQKMDEFLKQSKEK---------------------------------------------- 1BII_A TAA-D----MAA-QITRRKWEQA---GAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRR----PEGDVTLRCWALGFYPADITLTWQLN 2P24_A IWM-------------LPEFAQLR--SFDPQGGLQNIATGKHNLGVLTKRSNSTPATNEAPQATVFPKSP--VLLGQPNTLICFVDNIFPPVINITWLRN 1CD1_A VPGAP----SWL-DLPIKVLNADQ--GTSATVQMLLNDTCPLFVRGLLEAGKSDLEKQEKPVAWLSSVP---SSAHGHRQLVCHVSGFYPKPVWVMWMRG 2WY3_A TVP-QSSRAQTLAMNVTNFW-KEDAMKTKTHYRAMQ-ADCLQKLQRYLKSGVAIRRTVPPMVNVTCSEVS----EGNITVTCRASSFYPRNITLTWRQDG 1LQV_A QAD-TQVTSGVV-TFTLQQLNAYN--RTRYELREFLEDTCVQYVQKHISAENTKGSQTSRSYTS------------------------------------ 3JTS_A TAA-D----MAA-QNTQRKWEAA---GEAEQHRTYLEGECLEWLRRYLENGKETLQRADPPKTHVTHHPV----SDQEATLRCWALGFYPAEITLTWQRD 1OW0_A R--DL----CGC-YSVSSVLPGCA--EPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEELALNELVTLTCLARGFSPKDVLVRWLQG 1HXM_A AVL-K----ILA-PSERDEGSYYC--ACDTLGMGGEYTDKLIFGKGTRVTVEPRSQPHTKPSVFVMKNG---------TNVACLVKEFYPKDIRINLVSS DSSP --GGGS---EEEE-TTS-E----EEEEEEEE-TTGGGGEE---EEEE-TTSSS-EEE-E- Q30201 K-QPMDAKEFEPKDVLPNG----DGTYQGWITLAVPPGEE---QRYTCQVEHPGLDQ-PLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQ 1S79_A ---------------------------------------------------------------------------------------------------- 3P73_A --GMVRDQETRWGGIVPNS----DGTYHASAAIDVLPEDG---DKYWCRVEHASLPQ-PGLFSWEPQ--------------------------------- 1KCG_C ---------------------------------------------------------------------------------------------------- 1JFM_A ---------------------------------------------------------------------------------------------------- 1BII_A --GEELTQEMELVETRPAG----DGTFQKWASVVVPLGKE---QKYTCHVEHEGLPE-PLTLRWGKEEPPSSTKTNTVIIAVPVVLGAVVILGAVMAFVM 2P24_A --SKSVADGVYETSFFVNR----DYSFHKLSYLTFIPSDD---DIYDCKVEHWGLEE-PVLKHWEPEIPAPMSELTETSGSRLEVLFQ------------ 1CD1_A --DQ-EQQGTHRGDFLPNA----DETWYLQATLDVEAGEE---AGLACRVKHSSLGG-QDIILYWDARQAPVGLIVFIVLIMLVVVGAVVYYIWRRRSAY 2WY3_A --VSLSHNTQQWGDVLPDG----NGTYQTWVATRIRQGEE---QRFTCYMEHSGNHG-THPVPSGKVLVLQSQRTDFPYVSAAMPCFVIIIILCVPCCKK 1LQV_A ---------------------------------------------------------------------------------------------------- 3JTS_A --GEDQTQDTELVETRPAG----DGTFQKWAAVVVPSGKE---QRYTCHVQHEGLRE-PLTLRWEP---------------------------------- 1OW0_A SQEL-PREKYLTW-ASRQEPSQGTTTFAVTSILRVAAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGK------------------------------ 1HXM_A -----KKITEFDPAIVISP----SGKYNAVKLGKYE--DS---NSVTCSVQHDNK---TVHSTDFEVKTDSTDHVKPKETENTKQPSKS----------- DSSP Q30201 GSRGAMGHYVLAERE---------------- 1S79_A ------------------------------- 3P73_A ------------------------------- 1KCG_C ------------------------------- 1JFM_A ------------------------------- 1BII_A KRRRNTGGKGGDYALAPGSQSSDMSLPDCKV 2P24_A ------------------------------- 1CD1_A QDIR--------------------------- 2WY3_A KTSAAEGP----------------------- 1LQV_A ------------------------------- 3JTS_A ------------------------------- 1OW0_A ------------------------------- 1HXM_A -------------------------------
Based on the secondary structure for the HFE-Gen assigned by DSSP from the PDB structure (1a6z) the multiple sequence alignment conserves most parts of the secondary structure.
As HHSearch found just weak homologous, we searched in CATH to find structure homologous. The BLAST search in CATH found sequence homologous in a range from 49% to 22%. The HFE protein is classified as a two domain protein (Alpha Beta, Mainly Beta)<ref>http://www.cathdb.info/domain/1a6zA01</ref>. We found both domains with a sequence similarity of 100%. We than used BLAST to test the results at random with another search against CATH. We found for several proteins the same sequence identity distribution. With this BLAST search, we are now sure HFE is a protein with a high conservation in structure elements but a very weak sequence conservation. Therefore we would recommend a new acceptance range of about 20% to 40% sequence similarity for this protein.
I-Tasser
I-Tasser is a webservice for protein structure prediction provided and published by Ambrish Roy, Alper Kucukural and Yang Zhang at http://zhanglab.ccmb.med.umich.edu/I-TASSER/ for the CASP competition with outstanding achievement.
The I-Tasser protocol consists of several steps which are:
- threading the sequence into different structures to create an initial template.
- break the template apart into fragments which match the structure (leave the parts of the structrue out to which no sequence is assigned).
- Structure assembly and clustering
- use the cluster centroid for structure reassembly
- search the structure with the lowest energy and do REMO H-bond optimization to get the final model.
Further on, I-Tasser also predicts GO-Terms and binding sites. Therefore it uses the final model to search for global and local matches in the PDB to predict these terms.
For us, a problem is that I-Tasser only generates complete models, but the PDB structure of our protein is not complete. Therefore we compared the predicted secondary structure with the one form UniProt.
Compare secondary structure of the model and the structure assigned in UniProt:
Seq: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMEN Pred: CCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCEEEEECCCCCCCCCCEEEEEEECCCEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHH UniP: ---------------------------EEEEEEEEEEE----EEE--EEEEEE--EEEEEEEEEE--EEE--------TTTHHHHHHHHHHHHHHHHHHHHHHHHHHT Seq: HNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHV Pred: HCCCCCCEEEEEEECCCCCCCCCCCCCCCCCCCCCCEEEECCCHHHCHHHHHHHHHHHHHHHHCCCHHHHHHHHHCCCCHHHHHHHHHCCHHHHHCCCCCCCCCCCCC UniP: TT-EEE--EEEEEEEEEE-----EEEEEEEEE--EEEEEEEHHH-EEEEEE---HHHHHHHH---HHHHHHHHHHH-HHHHHHHHHHHHHTTT-------EEEEEEEE Seq: TSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGI Pred: CCCHHHHCHHHHCCCCCCEEEEEEECCCCCCCCCCEEEECCCCCCCCCCCEEEEECCCCCCCCEEEECCCCCCCCCEEEECCCCCCCCCCCCCCCCHHHHHHHCCHHH UniP: ----EEEEEEEEEEEEE--EEEEEE------HHH----EEEE-----EEEEEEEEE---HHHHEEEEEE---EEE-EEEE---------------------------- Seq: LFIILRKRQGSRGAMGHYVLAERE Pred: HHHHHHCCCCCCCCCCCCCHCCCC UniP: ------------------------
For a better overview we replaced the I-Tasser S for Sheet by an E like in the UniProt secondary structure.
As we can see, the secondary structure predicted by I-Tasser is mostly correct. Somtimes we see a slightly shift in the structure and sometimes the secondary structure elements have not the correct length. As this model is also based on a self hit, it is not a suprise to see a good results like this one.
Predicted Secondary Structure by I-Tasser
Sequence: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDF Predicted: CCCCHHHHHHHHHHHHHHHHHHHHCCCCCCCSSSSSCCCCCCCCCCSSSSSSSCCCSSSSCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHH Conf-Score: 985028899999999899875122045421036641367999985269985643743686068998778788540145583478888887676654315558 Sequence: WTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQ Predicted: HHHHHHHCCCCCCSSSSSSSCCCCCCCCCCCCCCCCCCCCCCSSSSCCCHHHCHHHHHHHHHHHHHHHHCCCHHHHHHHHHCCCCHHHHHHHHHCCHHHHHC Conf-Score: 888755315777644463525565898763541000558873365263022202455666677878887004598888767064299999999747666642 Sequence: QVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLV Predicted: CCCCCCCCCCCCCCCHHHHCHHHHCCCCCCSSSSSSSCCCCCCCCCCSSSSCCCCCCCCCCCSSSSSCCCCCCCCSSSSCCCCCCCCCSSSSCCCCCCCCCC Conf-Score: 599877567699854442101541541332479864358754456553541024888652112699807986310267512589998726840688766531 Sequence: IGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE Prediction: CCCCCCHHHHHHHCCHHHHHHHHHCCCCCCCCCCCCCHCCCC Conf-Score: 010211112222100246665443013678898651020169
Secondary structure elements are shown as H for Alpha helix, S for Beta sheet & C for Coil
Predicted Solvent Accessibility by I-Tasser
Sequence: MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDF Prediction: 723312000000000101112222011200120120023333331200000102322003123724434241311436413610352044144313323230 Sequence: WTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQ Prediction: 220132133351310001010021136231211333023032003016303403102321432433044143404422010333005103400630351154 Sequence: QVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPSPSGTLV Prediction: 342353313321443300000100101014010203346564435434135233334221320000000347533120214264144202020214542200 Sequence: IGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE Prediction: 000001100000011100000001334446443132333438
Values range from 0 (buried residue) to 9 (highly exposed residue)
I-Tasser predicted five Models with a C-Score from -0.557 to -3.298. They are ranked from one to five as seen below. As cutoff for the C-Score, we use -1.5 as recommended by the Zhang group<ref>Ambrish Roy, Alper Kucukural, Yang Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, vol 5, 725-738 (2010). Zhang et al.</ref> that is proposed to give a false-positive and a false-negative rate of about 0.1. That means that more than 90% of the quality predictions are correct. Therfore we just use Model1 for the comparison with the other methods.
Model1 has a TM-Score of about 0.64 and a RMSD of 7.7Å. For the prediction, I-Tasser used 1a6zA, 1s7qA, 1i4fA, 1de4A, 2vabA and 2bckA as templates. The templates have an identity of about 40% except for the self hit 1a6z. A special case is 1de4 which is the transferin receptor, but in complex with the HFE protein (chain A) which is a self hit as well. The sequence in this case is also identical, but we can not give any conclusion about the 3D structure of the protein bind to the receptor. Because of the self hit, we run I-Tasser a second time with the constrain to exclude all templates with a sequence identity > 80%.
I-Tasser using templates with a sequence identity below 80% to avoid self hits.
The second run brought to our suprise the same results based also on the same self hit. We have at this point no idea what went wrong but because the self hit is just one out of five templates used to create the model, we decided to keep the best model (Model1) for the comparison with the other methods.
SwissModel
SwissModel is a server based tool provided by the SIB. It combines tools like PSI-PRED and DISOPRED for secondary structure and disordered region prediction.
The SwissModel workspace is a web-based service dedicated to protein structure homology modelling. It provide a personal working enviroment where several projects can be calculated parallel. The enviroment provide tools for template selection, model building and structure quality evaluation as well. To find suitable templates for a given target protein a library of experimental protein structures is searched<ref>http://bioinformatics.oxfordjournals.org/cgi/content/short/22/2/195</ref>.
The SwissModel repository is a database of annotated 3d protein structure models. The database consists of more than 3.4 million structures<ref>http://nar.oxfordjournals.org/content/37/suppl_1/D387.full</ref>.
All models were generated from the UniProt database with the SwissModel pipeline. Form the SwissModel repository the density of the QMEAN-Score is estimated to give a dent of the model quality of the predicted model.
The model created by SwissModel is based on a self hit, but we had no chance to exclude the protein itself from the prediction. We could just set a specific template, therefore we also run SwissModel in Alignment-Mode. So we had the chance to influence the alignment. And as one can see, the density of the QMEAN-Score and of the Automated mode and the Alignment mode are the same. Therefore the target (1a6z) and the template (1bii) are part of the same reference set. We take this as an indicator for a good template choise, because the template is in the same set as the target which is also used as a template in the Alignmet mode. Therefore we rated this as evidence for the high diversity of the MHC 1 family.
Automated Mode
Model information:
Modelled residue range: 26 to 297
Based on template: 1a6zC (2.60 Å)
Sequence Identity [%]: 100
Evalue: 7.66e-163
Quality information:
QMEAN Z-Score: -1.035
Even though the model is based on a self hit, the Z-Score is about -1, which means that the model is one standard deviation from the mean. The model is not quite unlikely but also not the most probable one.
Alignment Mode
Model information:
Modelled residue range: 1 to 272
Based on template: 1bii_A
Quality information:
QMEAN Z-Score: -2.065
TARGET 26 RSH SLHYLFMGAS EQDLGLSLFE 1biiA 1 gsh slryfvtavs rpgfgeprym TARGET sss ssssssssss sss 1biiA sss ssssssssss sss TARGET 49 ALGYVDDQLF VFYDHES--R RVEPRTPWVS SRISSQMWLQ LSQSLKGWDH 1biiA 24 evgyvdntef vrfdsdaenp ryeprarwie -qegpeywer etrrakgneq TARGET ssssss sss sssss sss hhh hh hhhhh hhhhhhhhhh 1biiA ssssss sss sssss sss hh hhhhh hhhhhhhhhh TARGET 97 MFTVDFWTIM ENH-NHSKES HTLQVILGCE MQEDNST-EG YWKYGYDGQD 1biiA 73 sfrvdlrtal ryynqsaggs htlqwmagcd vesdgrllrg ywqfaydgcd TARGET hhhhhhhhhh hhh ssssssssss sss sss ss sssssss ss 1biiA hhhhhhhhhh hhh ssssssssss sss sssss sssssss ss TARGET 145 HLEFCPDTLD WRAAEPRAWP TKLEWERHKI RARQNRAYLE RDCPAQLQQL 1biiA 123 yialnedlkt wtaadmaaqi trrkweqa-g aaerdrayle gecvewlrry TARGET sssss s ss hh hhhhh hhhhhhhhh hhhhhhhhhh 1biiA sssss s ss hhh hhhhhhh hhhhhhhhh hhhhhhhhhh TARGET 195 LELGRGVLDQ QVPPLVKVTH HVTS-SVTTL RCRALNYYPQ NITMKWLKDK 1biiA 172 lkngnatllr tdppkahvth hrrpegdvtl rcwalgfypa ditltwqln- TARGET hhh ssssss sss ssss ssssss sssssss 1biiA hhh ssssss sss ssss ssssss sssss TARGET 244 QPMDAKEFEP KDVLPNGDGT YQGWITLAVP PGEEQRYTCQ VEHPGLDQPL 1biiA 221 geeltqemel vetrpagdgt fqkwasvvvp lgkeqkytch veheglpepl TARGET ss s sss s sssssssss sssss ss s 1biiA ss s sss s sssssssss sss ss s TARGET 294 IVIW 1biiA 271 tlrw- TARGET ss 1biiA ss
As one can see, a very similar secondary structure in this alignment is shown, and also a very similar 3d structure. But one beta-sheet that is not connected to the rest of the protein is not part of the SwissModel model (chain B). The RMSD for the model is about 2.9. This is a quite good results but just the residues which are superimposed are used for the calculation. So the missing beta-sheet is not a part of the calculation.
MODELLER
MODELLER is a standalone application used for protein structure modelling by satisfying spatial restraints. These restraints derive from differenty types of information, so the model is not only based on the target-template alignemt (but it also could). MODELLER is capable of pairwise/multiple alignment, fold assignment und modeling of loops.
We downloaded and installed Modeller locally to our Windows PC and used the examples given at the Workflow homology modelling glucocerebrosidase.
Our target has been set to the FASTA sequence of HFE_HUMAN. Our standard template for the single template-target alignment has been set to chain A of 1BII, because it covers the whole sequence of the HFE_HUMAN. For the multiple sequence alignment we used additional to 1BII the protein structures 1S79 and 3P73. Both, 1S79 and 3P73 were chosen because of the relativ high sequence indentity of about 37% of 1S79 and because 3P73 is a classical MHC class I molecule with a similar function to the HFE_HUMAN protein.
Single template-target
Scripts
script_pairwise-alignment-template-target.py
from modeller import * env = environ() aln = alignment(env) mdl = model(env, file='1BII.pdb', model_segment=('FIRST:A', 'END:A')) aln.append_model(mdl, align_codes='1BII', atom_files='1BII.pdb') aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN') aln.align2d() aln.check() aln.write(file='pairwise-2d.ali', alignment_format='PIR') aln.align() aln.check() aln.write(file='pairwise.ali', alignment_format='PIR')
script_pairwise-to-model.py
from modeller import * from modeller.automodel import * env = environ() a = automodel(env, alnfile = 'pairwise.ali', #file:pir:alignment knowns = '1BII', #file:pdb:template sequence = 'HFE_HUMAN', #id:target assess_methods=(assess.DOPE, assess.GA341)) a.starting_model= 1 a.ending_model = 1 a.make() b = automodel(env, alnfile = 'pairwise-2d.ali', #file:pir:alignment knowns = '1BII', #file:pdb:template sequence = 'HFE_HUMAN', #id:target assess_methods=(assess.DOPE, assess.GA341)) b.starting_model= 2 b.ending_model = 2 b.make()
Alignments
We used two different alignments for Modeller, one without use of structural information at the template side:
pairwise.ali
>P1;1BII structureX:1BII.pdb: 1 :A:+383 :P:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B; ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28 -------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI ALNEDLKTWTAADMAAQITRRKWEQAGAAER-DRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD VTLRCWALGFYPADITLTWQLNGEEL-TQEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL RW/IQKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTEFTP TETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI* >P1;HFE_HUMAN sequence:reference: : : : :::-1.00:-1.00 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDH--ESRRVEPRTP WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNST-EGYWKYGYDGQDHL EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV IW-------------EPSPSGTLVI---------------------GVISGIAVFVVILFIGILFIILRKRQGSR GAMGHYV-------LAERE-------------------*
And one with the use of structural information at the template side:
pairwise-2d.ali
>P1;1BII structureX:1BII.pdb: 1 :A:+383 :P:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B; ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28 ---------------------G----SHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR WIEQE-GPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYI ALNEDLKTWTAADMAAQITRRKWE-QAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGD VTLRCWALGFYPADITLTWQLNGEELT-QEMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTL RW/I---QKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTE FTPTETDTYACRVKHDSMAEPKTVYWDRDM/RGPGRAFVTI* >P1;HFE_HUMAN sequence:reference: : : : :::-1.00:-1.00 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYD--HESRRVEPRTP WVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSK-ESHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHL EFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTS-SV TTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIV IW-EPSPSGTLVIGVIS---------GIAVFVVILF--IGILFIILRK-RQGSRGAMGH---------YVLAERE -----------------------------------------*
The models will be presented under model comparison, but surprisingly the model with the structural information is worse than the model without. We think Modeller has some issues to threader the sequence of HFE_HUMAN into the given structure if 1BII. Therefore, we derive the posiblility that 1a6z, which have a very similar structure to 1bii, has a different amino acid composition for this type of structure. But at the moment we have no chance to test and prove this.
Alignment: multiple template-target
Scripts
script_msa-align-templates.py
from modeller import * env = environ() aln = alignment(env) for (code, chain) in (('1BII', 'A'), ('1S79', 'A'), ('3P73', 'A')): mdl = model(env, file=code, model_segment=('FIRST:'+chain, 'LAST:'+chain)) aln.append_model(mdl, atom_files=code, align_codes=code+chain) aln.salign() aln.check() aln.write(file='MSA.ali', alignment_format='PIR')
script_msa-align-target-to-msa.py
from modeller import * env = environ() aln = alignment(env) aln.append(file='MSA.ali', align_codes='all') aln_block = len(aln) aln.append(file='hfe_human.pir', align_codes='HFE_HUMAN') aln.salign() aln.check(); aln.write(file='MSA.ali', alignment_format='PIR')
script_msa-to-model.py
from modeller import * from modeller.automodel import * env = environ() a = automodel(env, alnfile = 'MSA.ali', #file:pir:alignment knowns = ('1BIIA', '1S79A', '3P73A'), #file:pdb:template sequence = 'HFE_HUMAN', #id:target assess_methods=(assess.DOPE, assess.GA341)) a.starting_model = 1 a.ending_model = 1 a.make()
Alignment
The MSA used by Modeller is:
>P1;1BIIA structureX:1BII:1 :A:+274 :A:MOL_ID 1; MOLECULE MHC CLASS I H-2DD; CHAIN A; FRAGMENT HEAVY CHAIN, EXTRACELLULAR DOMAINS; SYNONYM DD; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2 MICROGLOBULIN; CHAIN B; ENGINEERED YES; MOL_ID 3; MOLECULE DECAMERIC PEPTIDE; CHAIN P; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-3A; MOL_ID 2; ORGANISM_SCIENTIFIC MUS MUSCULUS; ORGANISM_COMMON HOUSE MOUSE; ORGANISM_TAXID 10090; CELL_LINE BL21; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21 (DE3) PLYSS; EXPRESSION_SYSTEM_PLASMID PET-8C; MOL_ID 3: 2.40: 0.28 -------------------------GSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYEPRAR WIEQEGPEYWERETRRAKGNEQSFRVDLRTALRYYNQSAGGSHTLQWMAGCDVESDGRLLRGYWQFAYDGCDYIA LNEDLKTWTAADMAAQITRRKWEQAGAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRRPEGDVT LRCWALGFYPADITLTWQLNGEELTQ-EMELVETRPAGDGTFQKWASVVVPLGKEQKYTCHVEHEGLPEPLTLRW ---------------------------------------------------* >P1;1S79A structureX:1S79:100 :A:+103 :A:MOL_ID 1; MOLECULE LUPUS LA PROTEIN; CHAIN A; FRAGMENT CENTRAL RRM; SYNONYM SJOGREN SYNDROME TYPE B ANTIGEN, SS-B, LA RIBONUCLEOPROTEIN, LA AUTOANTIGEN; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC HOMO SAPIENS; ORGANISM_COMMON HUMAN; ORGANISM_TAXID 9606; GENE SSB; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN BL21(DE3)PLYSS; EXPRESSION_SYSTEM_VECTOR PET28:-1.00:-1.00 -------------------------GRWILKNDVKNRSVYIKGFPTDATLDDIK--------------------- --------------------------------------------------------------------------- ----------------------------------------EWLEDKGQVLNIQMRRT------------------ --------------LHKAFKGSIFVV-FDSIESAKKFVETPGQKYKETDLLILFKDDYFAKKNEERKQNKVE--- ---------------------------------------------------* >P1;3P73A structureX:3P73:-1 :A:+275 :A:MOL_ID 1; MOLECULE MHC RFP-Y CLASS I ALPHA CHAIN; CHAIN A; FRAGMENT UNP RESIDUES 20-294; ENGINEERED YES; MOL_ID 2; MOLECULE BETA-2-MICROGLOBULIN; CHAIN B; ENGINEERED YES:MOL_ID 1; ORGANISM_SCIENTIFIC GALLUS GALLUS; ORGANISM_COMMON BANTAM,CHICKENS; ORGANISM_TAXID 9031; GENE YFV; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN TB1; EXPRESSION_SYSTEM_VECTOR_TYPE PLASMID; EXPRESSION_SYSTEM_PLASMID PMAL-P4X; MOL_ID 2; ORGANISM_SCIENTIFIC GALLUS GALLUS; ORGANISM_COMMON BANTAM,CHICKENS; ORGANISM_TAXID 9031; GENE B2M; EXPRESSION_SYSTEM ESCHERICHIA COLI; EXPRESSION_SYSTEM_TAXID 562; EXPRESSION_SYSTEM_STRAIN TB1; EXPRESSION_SYSTEM_VECTOR_TYPE PLASMID; EXPRESSION_SYSTEM_PLASMID PMAL-P4X: 1.32: 0.16 -----------------------EFGSHSLRYFLTGMTDPGPGMPRFVIVGYVDDKIFGTYNSKSRTA--QPIVE MLPQEDQEHWDTQTQKAQGGERDFDWNLNRLPERYNKSKG-SHTMQMMFGCDILEDGS-IRGYDQYAFDGRDFLA FDMDTMTFTAADPVAEITKRRWETEGTYAERWKHELGTVCVQNLRRYLEHGKAALKRRVQPEVRVWGKEADGILT LSCHAHGFYPRPITISWMKDGMVRDQ-ETRWGGIVPNSDGTYHASAAIDVLPEDGDKYWCRVEHASLPQPGLFSW EP------------------------------------------------Q* >P1;HFE_HUMAN sequence:reference: : : : :::-1.00:-1.00 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVE-PRTPW VSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKE-SHTLQVILGCEMQEDNS-TEGYWKYGYDGQDHLE FCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTT LRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIW EPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE*
Analysis of models
All models generated by MODELLER are visible at Model comparison.
Model comparison
After trying serveral tools (RasWin, JMol, SwissPDB-Viewer), we decided to use PyMol for superimposing and displaying the model-target alignment of the proteins. We truncated the original HFE_HUMAN protein (pdbid: 1a6z) at chain C, thus we used only chain A and B for displaying. The original HFE_HUMAN is always shown in green and the model in red.
We created our models using PyMol by:
- load '1A6Z_AB.pdb' into PyMol (alternatively: command 'fetch 1A6Z' and then hide chain C and D)
- hide everything
- show cartoon
- color red
- load 'model.pdb' into PyMol
- hide everything
- show cartoon
- color green
- align 'model.pdb' to '1A6Z_AB.pdb'
- command 'ray' (nicer output!)
- save the image
For evaluating our models with the RMSD and TMScore we used TMalign. We were advised to use SAP for the RMSD and TMScore for the TMScore but TMScore failed because our target is the sequence of the HFE_HUMAN from UniProt and therefore longer than the '1BII' template. This causes a problem with TMScore because it needs pdbs with same length and the thus the superimposing of TMScore does not really work.
TMalign is able to use pdbs with different length and the scores are normalized by the second structure. We use '1A6Z' as second structure to create comparable scores of all our models. The modeling of HFE_HUMAN is very difficult because it is a multi domain protein. All the methodes do not support a multi domain modeling.
TMalign can be found at the website of the Zhang-Lab.
Picture | Model | RMSD | TM-Score |
Modeller: superimpose, template:1BII | 2.58 | 0.86468 | |
Modeller: superimpose, see-support, template:1BII | 3.42 | 0.59586 | |
Modeller: superimpose, msa, template:1BII,1S79,3P73 | 2.05 | 0.89042 | |
I-Tasser | 1.61 | 0.93760 | |
SwissModel | 2.67 | 0.85048 | |
SwissModel self | 0.08 | 0.99984 |
As one can clearly see, the I-Tasser model is the best with an TM-Score ~0.94 followed by the MSA model of MODELLER with an TM-Score of ~0.89 and the SwissModel with an TM-Score of ~0.85.
The worst model is the secondary structure supported information at the template site model of MODELLER with an TM-Score of ~0.6. We are sure, that the low sequence identity of only 22% affected this model the bad way.
All of our models are really good, except for the sse-supported model of MODELLER.
Jigsaw
Alignmets used in Jigsaw:
Q30201 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHES--RRVE-PRTPWVSSRISSQMWLQLSQSLKGWDHM 1BII_A -MGAMAPRTLLLLLAAALGPTQTRAGSHSLRYFVTAVSRPGFGEPRYMEVGYVDNTEFVRFDSDAENPRYE-PRARWIE-QEGPEYWERETRRAKGNEQS Q30201 FTVDFWTIMENHN-HSKE--------SHTLQV-ILGCEMQED-NST-E------------GYWKYGYD----------------GQDHLEFCPDTLDW-- 1BII_A FRVDLRTALRYYNQSAGG--------SHTLQW-MAGCDVESD-GRLLR------------GYWQFAYD----------------GCDYIALNEDLKTW-- Q30201 RAA-E----PRA-WPTKLEWERHK--IRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVT----S-SVTTLRCRALNYYPQNITMKWLKD 1BII_A TAA-D----MAA-QITRRKWEQA---GAAERDRAYLEGECVEWLRRYLKNGNATLLRTDPPKAHVTHHRR----PEGDVTLRCWALGFYPADITLTWQLN Q30201 K-QPMDAKEFEPKDVLPNG----DGTYQGWITLAVPPGEE---QRYTCQVEHPGLDQ-PLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQ 1BII_A --GEELTQEMELVETRPAG----DGTFQKWASVVVPLGKE---QKYTCHVEHEGLPE-PLTLRWGKEEPPSSTKTNTVIIAVPVVLGAVVILGAVMAFVM Q30201 GSRGAMGHYVLAERE---------------- 1BII_A KRRRNTGGKGGDYALAPGSQSSDMSLPDCKV
Discussion
For the I-Tasser protocol, it is not possible to choose a specific template, so we run I-Tasser twice, first with standard parameter, and one with a similarity threshold of 80%. In the second case, we got a model also based on a self hit. We repeated the prediction a third time with the same result. We did not find out for what reason the given threshold was ignored.
Our attempts to get homolougs at all given categories (>60%, >40%, >20%) was not successful, because HHSearch was not able to list machting ones. Doing a Blast search against the NR-Database failed too and resulted only in proteins with 40% or less sequence identity. Thus we come to the conclusion, that the HFE family must have a very high diversity of the sequence.
All of our models are only modeled by using chain A and never chain B. We can not fix that problem, because according to PDB, there is no contact between chain A and chain B (it is also clearly visible if you look closely at the pictures oabove and search at the left for the plain betasheets in red).
The templates which we had choosen from the HHSearch were used to cover the whole protein sequence and give a special coverage of the transmembrane region. But as we saw later the tools do not support multiple sequence alignments. Therefore we decided to use '1BII' as template for SwissModel and Modeller because it covers the sequence completely and with a sequence similarity of 22% it is in the lower midrange of the HHSearch results. '1S79' has with 37% more sequence identity but also a very worse conservation with HFE_HUMAN. We decided to rank coverage of the whole sequence higher than the sequence identity.
References
<references />