Difference between revisions of "Werner Syndrome"
Line 595: | Line 595: | ||
− | As expected, the model contains 'unfolded' regions which come as a result of the lack of template structures. The long disordered region cannot be modeled since there is no suitable template structure available for it (since it is not possible to crystallize such flexible region). This issue will set an obstacle for the modeling process. As it is not possible to generate a structure that covers the entire protein, the model will have to be build in parts. The schematic view of the protein sequence and the defined domains it contains (figure 26) shows two portions of the protein, separated by the disordered region. The first portion enclosures an exonuclease domain, which is an unique feature of the WRN protein. Among the five human RecQ DNA helicases, this is the only one presenting a functional exonuclease domain. This domain was characterized in its structure, biochemical activity and participation in DNA end joining [32]. This region is isolated from the rest of the protein by the disordered region. The C-terminal half of the protein contains at least three domains and the signal peptide (NLS). These domains are (in order from the N-terminal) the RecQ domain which is the longest domain whithin the protein sequence, the helicase-and-ribonuclease D/C-terminal (HRDC) domain, of unknown function (recently pointed out to be important for dissolution of double Holliday junctions [33]) and a domain which was only characterized as a C-terminal domain. After these domains (briefly separated from each other by a few residues predicted as secondary disordered regions) is the nuclear localization signal. |
+ | As expected, the model contains 'unfolded' regions which come as a result of the lack of template structures. The long disordered region cannot be modeled since there is no suitable template structure available for it (since it is not possible to crystallize such flexible region). This issue will set an obstacle for the modeling process. As it is not possible to generate a structure that covers the entire protein, the model will have to be build in parts. The schematic view of the protein sequence and the defined domains it contains (figure 26) shows two portions of the protein, separated by the disordered region. The first portion enclosures an exonuclease domain, which is an unique feature of the WRN protein. Among the five human RecQ DNA helicases, this is the only one presenting a functional exonuclease domain. This domain was characterized in its structure, biochemical activity and participation in DNA end joining [32]. This region is isolated from the rest of the protein by the disordered region. The C-terminal half of the protein contains at least three domains and the signal peptide (NLS). These domains are (in order from the N-terminal) the RecQ domain which is the longest domain whithin the protein sequence, the helicase-and-ribonuclease D/C-terminal (HRDC) domain, of unknown function (recently pointed out to be important for dissolution of double Holliday junctions [33]) and a domain which was only characterized as a C-terminal domain. After these domains (briefly separated from each other by a few residues predicted as secondary disordered regions) is the nuclear localization signal. Therefore, one alternative would be to build two separate models. One covering the exonuclease domain and the other covering the RecQ, HRDC and C-terminal domains of the WRN protein. |
Revision as of 16:00, 14 June 2011
Contents
ABOUT THIS WIKI
This Wiki Page was created and it is maintained by Mainá Bitar for the "Protein Structure and Function Analysis Practical" of 2011 (SS).
WERNER SYNDROME
You can Download a PDF version of this information here: Media:WS.pdf.
Introduction
The Werner Syndrome (WS) is an autosomal recessive disorder, also known as Adult Progeria. The syndrome was described for the first time in 1904 by Otto Werner (and therefore, named after him), in his PhD thesis entitled “Über katarakt in Verbindung mit Sklerodermie” (which can be translated to “About cataracts connected to sclerodermia”). In the first 90 years of research concerning WS, over 1000 patients were reported, 75% of which were of Japanese descent (figure 1) [1]. WS is one of the several types of segmental progeroid syndromes, which affect multiple tissues and organs (on the other hand, unimodal syndromes predominantly affect a single organ) [2].
<center|font size=2>Figure 1: Distribution of Werner Syndrome by nationality as registered from 1904 until 1994 [1]. All countries with at least one patient are shaded.
Symptoms
As one can expect, the most notable symptoms of WS mimic the background of the most general condition called Progeria, with a complex phenotype of accelerated aging. The patients prematurely acquire the appearance of someone several decades older, accompanied by loss or graying of hair, scleroderma-like skin and voice alterations, usually around the second or third decade of life [3]. The phenotype of WS was previously summarized as a “caricature of aging” [1]. In general, the subjects develop normally until adolescence, when there is absence of the common growth spurt. The clinical manifestations usually include atherosclerosis, osteoporosis, diabetes, lenticular cataracts, heart failure, cancer and other age-related conditions that appear during early adulthood, following puberty (figure 2) [1, 2]. The typical cause of death is cancer or cardiovascular disease, often occurring between the fourth and fifth decades of life. While in 1966, the median age of death was 47 years [4], in 1997, Makoto Goto reported a surprising median age of death of 54 years [2]. At the cellular level, a reduction in the replicative rate is often observed (cellular senescence) and genomic instability is present in the form of chromosome breaks and translocations, as well as large deletions at the molecular level. A higher occurrence of somatic mutations is also related to the syndrome [1-4].
<center|font size=2>Figure 2: Appearance of symptoms in Werner Syndrome patients. The horizontal axis represents the average age at which each clinical manifestation occurs, according to Makoto Goto in his study from 1997 [1].
Related Gene and Protein
The gene related to WS (WRN) was identified as located in the chromosome 8p12 and cloned for the first time in 1996 [5] and pointed as homologous to members of the DNA helicase family. Later on, the protein was shown to catalyze DNA unwinding, as expected [5]. The protein was predicted to have 1,432 amino acids and 35 exons, from which 34 are protein coding. At first, four mutations at this gene were identified in WS patients, all correlated to truncated proteins of no more than 1,250 amino acids. The mutated proteins were shorter, but did not present any effects on the helicase domain itself, which is situated between amino acids 569 and 859. Five additional mutations were identified shortly after, being two nonsense mutations, one mutation at a splice-junction site and a deletion leading to a frameshift [3]. The majority of these mutations directly affect the helicase domain, two of those are located within the domain and other two lead to its loss with truncated proteins.
The WRN gene can be divided into three distinct regions. The N-terminal portion, comprising codons 1-539 is mainly acid and first it was described as presenting no homology to known genes. Currently, this region is known to contain an exonuclease domain, an unusual feature among members of this protein family [6]. Similar high concentration of acidic residues are also observed in other DNA repair-deficiency disorders. The median portion of the gene, from codon 540 to 963 is closely related to other helicases from several organisms, presenting all seven conserved motifs that characterize the protein. The C-terminal end of the WRN gene enclosures a nuclear localization signal (NLS) [7]. Two additional domains are found between the helicase domain and the NLS, namely a RecQ helicase conserved region (RQC) and a helicase RNaseD C-terminal conserved region (HDRC, figure 3) [2]. The WRN protein is likely to act on DNA repair, recombination and replication as well as in the maintenance of telomeres.
<center|font size=2>Figure 3: Basic structure of the WRN gene, with known domains highlighted according to description made by Friedrich et al. [2].
The majority of the 60 first described mutations associated to the occurrence of WS are related to the loss of the NLS, causing the protein to accumulate outside the nucleus and therefore be incapable of performing its function [2]. Nevertheless, in the past year, 18 new mutations were reported, yielding a total of 71 WRN disease associated mutations identified in clinically diagnosed patients (table 1) [2].
<center|font size=2>Table 1: Mutations of the WRN gene associated to the Werner Syndrome, according to the mutation report by Friedrick et al., 2010. The total number of known mutations by class is given [2].
References
1. Goto M (1997). Hierarchical deterioration of body systems in Werner’s syndrome: Implications for normal ageing. Mechanisms of Ageing and Development, 98:239–254
2. Friedrich K, Lee L, Leistritz DF, Nurnberg G, Saha B, Hisama FM, Eyman DK, Lessel D, Nurnberg P, Li C, Garcia FVMJ, Kets CM, Schmidtke J, Cruz VT, Van den Akker PC, Boak J, Peter D, Compoginis G, Cefle K, Ozturk S, Lopez N, Wessel T, Poot M, Ippel PF, Groff-Kellermann B, Hoehn H, Martin GM, Kubisch C and Oshima J (2010). WRN mutations in Werner syndrome patients: genomic rearrangements, unusual intronic mutations and ethnic-specific alterations. Human Genetics, 128:103-11
3. Yu CE, Oshima J, Wijsman EM, Nakura J, Miki T, Piussan C, Matthews S, Fu YH, Mulligan J, Martin GM, Schellenberg GD and the Werner's Syndrome Collaborative Group (1997). Mutations in the consensus helicase domains of the Werner Syndrome gene. The American Society of Human Genetics, 60:330-341
4. Epstein CJ, Martin GM, Schultz AL, Motulsky AG (1966). Werner's syndrome: a review of its symptomatology, natural history, pathologic features, genetics and relationship to the natural aging process. Medicine, 45:177-221
5. Yu CE, Oshima J, Fu YH, Wijsman EM, Hisama F, Alisch R, Matthews S, Nakura J, Miki T, Ouais S, Martin GM, Mulligan J, Schellenberg GD (1996). Positional cloning of the Werner’s syndrome gene. Science, 272:258–262
6. Huang S, Li B, Gray MD, Oshima J, Mian IS, Campisi J (1998) The premature ageing syndrome protein, WRN, is a 3' → 5' exonuclease. Nature Genetics, 20:114–116
7. Suzuki T, Shiratori M, Furuichi Y, Matsumoto T (2001). Diverged nuclear localization of Werner helicase in human and mouse cells. Oncogene, 20:2551-2558
A. Huang S, Lee L, Hanson NB, Lenaerts C, Hoehn H, Poot M, Rubin CD, Chen DF, Yang CC, Juch H, Dorn T, Spiegel R, Oral EA, Abid M, Battisti C, Lucci-Cordisco E, Neri G, Steed EH, Kidd A, Isley W, Showalter D, Vittone JL, Konstantinow A, Ring J, Meyer P, Wenger SL, von Herbay A, Wollina U, Schuelke M, Huizenga CR, Leistritz DF, Martin GM, Mian IS and Oshima J (2006). The spectrum of WRN mutations in Werner syndrome patients. Human Mutation, 27:558-567
SEQUENCE ALIGNMENTS
17.05
About Sequence Alignments
For the Structural Biologist involved in the study of proteins and interested in structure-function relationships or comparative modeling, an accurate alignment is as important as a significant similarity between the sequences [8]. This justifies that, while studying protein structures, one should be careful and dedicate efforts to the generation and analysis of such alignments. In this context, a sequence alignment can be defined as a method of comparison that takes into account typical conservations, substitutions, insertions and deletions of amino acids and represents those in a graphical format. In general, amino acids substitutions not related to changes in physico-chemical properties are considered as conservatives of protein function and structure. On the other hand, when a substitution leads to physico-chemical changes it is frequently non-conservative (although exceptions can be observed to both rules). For this reason, when evaluating the conservation of the structure and function of a protein from its amino acids substitutions, one should be capable of evaluating the relation between this residues. This can be accomplished, for example, by grouping amino acids together according to their properties in a Venn diagram (figure 5). In this context, an optimal alignment can be defined as the one with highest score, given a certain scoring system that considers these characteristics.
<center|font size=2>Figure 5: Venn diagram for the 20 most common amino acids. In this diagram it is possible to graphically explicit the structural and physico-chemical relationships among the residues, for example.
The first computer programs capable of computing biological comparison of sequences in a reasonable amount of time were originated more than 25 years ago. From this point on, the process of inferring homology from sequence similarity became routine and considerably more reliable. However, despite the fact that sequence alignment is a well established method, homology inference from sequence similarity can be rather controversial. For this reason, to infeer homology between two proteins, one should not only consider the similarity between those, but also evaluate how likely is that the proteins are actually biologically related [8]. Usually, a 30% sequence similarity is considered an accurate cut-off above which two proteins can be considered possible homologues. The second step then is to conclude that this similarity is not a result of a convergence from evolutionary unrelated origins. To guide this evaluation one can take advantage of specific features of related sequences that can be recognized by mathematical models and well distinguished from what is observed in randomly generated sequences relationships [9]. In 2005, Pearson and Sterk [8] presented important conclusions that include: i) the scores attributed to alignments of non-related sequences are not distinguishable from scores of randomly generated sequences; ii) if the attributed score for similar sequences cannot be reproduced with a set of randomly generated sequences, then those must be evolutionary related; iii) sequences that share statistically significant similarity are considered homologues. In addition to those general rules, when proteins share an uncommonly high degree of structural similarity, it is suggested that they are non-recognized homologues.
One can therefore conclude that high quality alignments are crucial to the analysis of structure and function of a protein and to the search of homologues and related sequences. From this statement we should not exclude the mentioned cases where the protein structures are more conserved than the related sequences.
References
8. Pearson WR and Sierk ML (2005). The limits of protein sequence comparison? Current opinion in structural biology, 15(3):254–260
9. Pearson WR (1998). Empirical statistical estimates for sequence similarity searches. Journal of molecular biology, 276(1):71–84
Methodology
Reference Sequence: To acquire a reference sequence, a search by keyword was performed at the National Center for Biotechnology Information (NCBI [1]) using the term "WRN Protein" and restricting the organisms group to Homo sapiens. The retrieved sequence (figure 6) is 1432 amino acids long and is registered under the Gi number 3719421. The sequence is as follows:
<center|font size=2>Figure 6: Amino acid sequence of the human WRN protein (Gi 3719421) as retrieved from NCBI.
For this work the WRN protein was used as Query in a BLAST (Basic Local Alignment Search Tool [2]) search against the Database of non-related sequences (nr). This search retrieved 116 related proteins with e-value score between 0 and 3e-91. The algorithm was able to recognize different conserved domains within the protein sequence, as shown below (figure 7).
<center|font size=2>Figure 7: Conserved domains identified in the sequence of the WRN protein. <center|font size=2>Obs: Compare this result with the gene structure of the protein, presented in Figure 3.
Among the sequences retrieved, the first 20 will be further used to generate sequence alignments and help describe basic features of WRN proteins and related sequences. These 20 best hits are listed in the figure below (figure 8).
<center|font size=2>Figure 8: The 20 best results for the BLAST search performed for the human WRN protein.
From these 20 sequences, a curated input file was generated for further multiple sequence alignments. In this file, each representative specie has only one related WRN protein sequence (except for Homo sapiens, since different isoforms are represented). A long insertion was observed in the very beginning of the sequence from Canis familiaris. This region was deleted from the input file, to facilitate the visualization of the alignments (the original sequence was saved for possible further analysis of the insertion). The final version of the input file has 16 sequences, namely 5 from Homo sapiens, one synthetic protein and 10 proteins from different mammalian species. The first feature to be noticed is that the proteins are highly conserved. This is expected from the results of the BLAST search, since the proteins were characterized by a high percentage of identity/similarity and query coverage. Highly conserved proteins are often involved in crucial functions in the cell and therefore can bare only subtle changes in their sequence/structure throughout the course of evolution. This is probably the case of WRN protein, since its function is related to DNA maintainance and replication.
Multiple sequence alignments were performed by 4 different tools: ClustalW, T-Coffee, Promals and Cobalt. Each of those alignments were subsequently analyzed with DNATagger [10], a web tool for Colorizing alignments according to codon properties. In this case, the color scheme takes into account physico-chemical properties of the residues, as listed by Lehninger.
Blue (Positive amino acids) - Histidine, Lysine and Arginine (depending on pH).
Red (Negative amino acids) - Aspartic acid and Glutamic acid (depending on pH).
Green (Polar but non-charged amino acids) - Cysteine, Glycine, Asparagine, Glutamine, Serine, Threonine and Tyrosine.
Yellow (Non-polar but non-charged amino acids) - Alanine, Phenylalanine, Isoleucine, Leucine, Methionine, Proline, Valine and Triptophan.
In the next figure is the alignment performed by Promals (figure 9). And the other results for multiple sequence alignments (colored by DNATagger as explained above) can be downloaded in a PDF format as listed below.
<center|font size=2>Figure 9: The first part of the multiple sequence alignment generated by Promals for the curated set of WRN protein homologues.
As each method has its own algorithm and structure from which it generates the alignments, one can expect that those will run in different time frames. Indeed, while submitting the same input file to the different programs, some were faster than others. The summary of time needed to generate the results is presented below (figure 10).
<center|font size=2>Figure 10: Summary of time needed to generate multiple sequence alignments with each method, given the same input file.
SEQUENCE-BASED ANALYSIS
24.05
Unless otherwise stated, all the predictions completed in this task were performed with only one input file, the protein sequence in FASTA format [3].
Secondary Structure Preditcion
In the protein world, the term secondary structure refers to stable local sub-structures [10]. The basic tridimensional patterns that form the overall protein structure. The most frequent and relevant secondary structure elements are the alpha helix and the beta strand (figure 11). These structural patterns arise from the formation of hydrogen bonds, connecting the peptide groups in the main-chain. Other protein regions are ordered, but do not form any regular secondary structure. Those regions are often called loops and can have important roles in protein function.
<center|font size=2>Figure 11: Structural representations of the main secondary structure elements. An alpha helix (in the bottom) and a beta strand (in the top) coordinating a zinc ion in a so-called zinc finger structure.
Several Bioinformatics tools are currently available to predict such secondary structure elements. One of the most used and accurate is Psipred [11], which was used to infer the secondary structure of the WRN protein. PsiPred is available within an online version [4] together with other sequence-based analysis methods. The full result from PsiPred can be downloaded as a PDF (below). Another secondary structure prediction tool is Jpred [12], which is also available as an online server [5]. Unfortunately, it is not possible to perform a prediction for proteins longer than 800 amino acids. As the WRN protein has more than 1400 residues, the sequence was broken (nearly on half) and two predictions were made.
Media:PsiPred.pdf Media:JpredFirstHalf.pdf Media:JpredSecondHalf.pdf
Transmembrane Helix Prediction
In the same online server that hosts PsiPred, one can run predictions of Transmembrane Helix regions with MEMSAT, a widely used transmembrane topology prediction method [13]. There are two different versions of this method, namely MEMSAT 3 and MEMSAT SVM (that uses supported vector machines). As discussed before, the WRN protein is nuclear and therefore it should not present a transmembrane region. Indeed, this method points to the presence of one possible transmembrane helix and both versions (3 and SVM) of this algorithm identify this helix in the same region (around the portion between residues 550 and 600). The difference would be the orientation of the protein (figure 12). While MEMSAT 3 classifies the N-terminal region as cytoplasmic, MEMSAT SVM suggest the exact opposite, situating the protein 'upside-down' with the N-terminal in the extracellular compartment. The inconsistency of the results obtained point to a non-transmembranar protein. As should be the case for the WRN protein.
<center|font size=2>Figure 12: Summary of MEMSAT prediction results. The version (3 or SVM) is indicated and the orientation of the molecule is taken from the indication of the N/C terminal regions. The yellow box represents the predicted transmembrane helix and the numbers inside the box correspond to the first and last residue of the helix.
From the Stockholm Bioinformatics Center there is another method for transmembrane helix prediction. It is available within an online server [6] in two versions, OCTOPUS [14] and SPOCTOPUS [15]. The main difference between those version is that SPOCTOPUS has an integrated signal peptide prediction. One reason for this approach is the weakness of many transmembrane region prediction methods, that confuse these regions with signal peptides, since both are composed mainly by hydrophobic residues. When this two algorithms were used, the result was similar to the one observed from MEMSAT predictions. Both OCTOPUS and SPOCTOPUS predicted a transmembrane helix around the residues 550 and 600. The difference is that, again, the programs predict the protein upside-down when compared to one another.
<center|font size=2>Figure 13: Summary of OCTOPUS prediction results. The version (OCTOPUS or SPOCTOPUS) is indicated and the orientation of the molecule is taken from the color code.
A third approach is to use the Phobius method. There are two versions of this algorithm, both available online [7]. Phobius [16] and PolyPhobius [17] predict transmembrane helix and signal peptides. From the results (figure 14) we can conclude that this approach is as successful as the others in predicting the transmembrane helix. On the other hand, no signal peptide was predicted.
<center|font size=2>Figure 14: Summary of Phobius prediction results. The version (Phobius or PolyPhobius) is indicated.
Although all of the former mentioned prediction methods show a transmembrane helix in the region around residues 550 - 600, it is unlikely that this is really a transmembrane region. First, this is a nuclear protein, which means a low probability of a transmembrane portion. Second, the predicted helix is contained within the ATP binding domain, a well-defined protein domain that takes part in protein function. The presence of a transmembrane helix in this protein region would disturb protein function and it does not seem likely.
The last tool is also the most well-established of the available methods for transmembrane helix prediction. According to its authors, in 2001 TMHMM [20] was rated best in an independent comparison of programs for prediction of transmembrane helices [21]. The results from this method point to the absence of transmembrane helices in the WRN protein. This result is expected for a nuclear protein and, together with other evidences, it leads to the conclusion that no transmembrane helices are present in the WRN protein.
As the WRN protein then was shown not to present transmembrane helices, therefore yielding a negative result from TMHMM, known transmembrane proteins were chosen to give examples of a positive result by this method (figure sup. 1). Five different proteins were assigned as examples, as follows:
1. The human Retinol-binding protein 4 (GI:62298174).
2. The human Insulin-like peptide INSL5 (GI:205371762).
3. A Bacteriorhodopsin from Halobacterium (GI:114811).
4. The human Lysosome-associated membrane glycoprotein 1 (GI:206729915).
5. The human Amyloid beta A4 protein (GI:112927).
<center|font size=2>Figure Sup. 1: TMHMM results for the five proteins briefly described above (in order of listing). The color scheme can be observed in the first plot, with red standing for the predicted transmembrane regions.
The same analysis was performed with OCTOPUS and SPOCTOPUS and the results are as following (figure sup. 2). The most interesting observation that comes from the analysis of these results is the advantage of SPOCTOPUS prediction method. As it is obvious from the plots below, OCTOPUS erroneously predicts several transmembrane helices that are in fact signal peptides according to SPOCTOPUS. For the first two sequences, the retinol-binding protein and the insulin-like peptide, TMHMM predicts no transmembrane helices, while OCTOPUS predicts both sequences to contain one N-terminal helix each. This hypothesis is then concluded to be an artifact after the analysis of SPOCTOPUS results, that point to the presence of a signal peptide, in both cases. This summary of results can be said to justify the development of methodologies that not only predict transmembrane helices, but also signal peptides, which can be pointed as false positives in this context.
<center|font size=2>Figure Sup. 2: OCTOPUS (top 5 plots) and SPOCTOPUS (bottom 5 plots) results for the five proteins briefly described above (in order of listing). The color scheme can be observed in the first plot. It is interesting to compare the OCTOPUS results with the correspondent SPOCTOPUS results and observe the false positive transmembrane helices predicted by OCTOPUS and revealed (as signal peptides) by SPOCTOPUS.
Signal Peptide Prediction
All proteins are synthesized in the cytoplasm. But later these molecules can have different subcellular localizations. To correctly target proteins to a specific organelle inside the cell, or even to the extracellular space, there are signals within the sequence (or structure) of these proteins. To perform the task of searching for such signals in protein sequences, several computational methods were designed. These methods take advantage of specific properties of the localization signals to predict the subcellular localization of a protein, given its sequence. In this context, four different programs were used specifically to predict localization signals within the sequence of the WRN protein. Additionally (as discussed) some transmembrane helix prediction tools also report the presence of such signals as an additional information. As discussed, the WRN protein is located inside the nucleus. Therefore, it has a Nuclear Localization Signal (NLS) within its sequence (figure 3). Although this is known and it was proven by experimental results, none of the programs above (MEMSAT, SPOCTOPUS, Phobius and PolyPhobius) were able to predict the NLS. This is probably due to limitations of the methodologies and to address this, other approaches were tested.
As an alternative, SignalP [18] was used. This is a well established and widely used program. Once again, the NLS was not identified by the method (results not shown).
One final attempt was to use the Predict Protein server [19]. This time, the NLS was found by the server. The short sequence (ERKRRL) starts in position 1401 in the protein sequence (figure 15). One possible explanation for the lack of results with other methods would be that the methods employed before were not specifically able (trained) to identify the NLS, which is a specific type of localization signal. Therefore, although suitable for the identification of several distinct localization signals, these methods can fail to predict the presence of a signal to nuclear transport.
<center|font size=2>Figure 15: Prediction of a NLS within the WRN protein sequence, made by the Predict Protein server.
The fact that other predictive tools were not able to address the presence of this signal peptide (which was known to be present) can be associated to limitations of the techniques. Some algorithms are actually not able to predict NLS with the same accuracy they predict other localization signs. This hypothesis can be further confirmed by studying the references of each method.
GO Assignment
Gene Ontology (GO) terms assignment is (as defined in Wikipedia [8]) an initiative to unify the representation of gene and gene product attributes across all species. In other words, it is a classification of genes and proteins according to the biological roles they play. This classification can occur according to three major contexts, namely:
Cellular component, the parts of a cell or its extracellular environment;
Molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis;
Biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units.
A set of programs was used to assign GO terms to the WRN protein: Pfam [23], Protfun [24] and GOPET [25]. The predictions are represented in the figures below (figures 16, 17 and 18).
<center|font size=2>Figure 16: Domain class assignment results as predicted by the Pfam server.
<center|font size=2>Figure 17: Gene Ontology (GO) assignment results as predicted by the GOPET server.
<center|font size=2>Figure 18: GO assignment results as predicted by the GOPET server.
Disorder Prediction
One important observation about disorder prediction is that the terminal regions of the protein (N-terminal and C-terminal) are often referred to as disordered due to the intrinsic flexibility of this regions [26]. It is important to address the accuracy of such results and treat the terminal regions with a different perspective when analyzing the results.
The method used by IUPRED is based on the assumption that globular proteins usually present a large number of interactions between residues. This is the counterpart for the loss of entropy associated to the folding process. In contrast, disordered regions have unique sequences, that in general do not form the same level of interresidue interactions. With this distinction and a set of known globular proteins (with no disorder), the algorithm can predict the presence of disordered regions [27].
<center|font size=2>Figure 19: Disorder prediction of short regions performed by IUPRED.
<center|font size=2>Figure 20: Disorder prediction of long regions performed by IUPRED.
The Disopred server was used as another predictive tool for disordered regions (figure 21 and 22). The predictions were approximately consistent with the results produced by IUPRED. This is an indication of the validity of these predictions and points to a disordered region present in the protein.
<center|font size=2>Figure 21: Disorder prediction performed by Disopred. This partial result shows the predicted disordered region of the WRN protein.
<center|font size=2>Figure 22: Disorder prediction performed by Disopred.
As a third approach, another method was used. Poodle is a system that predicts disorder regions based on three strategies: short disorder regions prediction, long disorder regions prediction and unfolded protein prediction [29]. Once again, the same main region is predicted as disordered. Additionally, other shorter regions towards the C-terminal portion are also predicted and are consistent with the results of the other predictors.
<center|font size=2>Figure 23: Disorder prediction performed by Poodle.
To further confirm this prediction, the protein structure was analyzed and what one can observe is a structural gap precisely in the region predicted as disordered. There is no reliable entry in PDB containing the region between residues 250 and 500 of the WRN protein (figure 24) and this my be due to the presence of a disordered region. The same conclusion can be drawn from the observation of the shorter disordered regions. As the structure related to the best (first) hit in this BLAST search (against the PDB database) was solved by X-ray diffraction, a crystal had to be generated, containing several copies of the WRN protein. The presence of a disordered region can lead to a low resolution region in the diffraction plot. In fact, disordered regions were once defined based on this assumption. According to Keith Dunker, a disordered region can be predicted from the observation of a structural gap in a crystal structure [30].
<center|font size=2>Figure 24: BLAST result for the WRN protein using the database of sequences with related structure present in PDB.
References
10. Wikipedia
11. Jones DT (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292:195-202.
12. Cole C, Barber JD and Barton GJ (2008). The Jpred 3 secondary structure prediction server. Nucleic Acids Res, 35(2):W197-W201
13. Nugent T and Jones DT (2009). Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics. 10:159
14. Viklund H and Elofsson A (2008). Improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics
15. Viklund H, Bernsel A, Skwark M and Elofsson A (2008). A combined predictor of signal peptides and membrane protein topology.
16. Käll L, Krogh A and Sonnhammer ELL (2004). A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5):1027-1036.
17. Käll L, Krogh A and Sonnhammer ELL (2005). An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics, 21(1):i251-i257.
18. Nielsen H, Engelbrecht J, Brunak S and von Heijne G (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering, 10:1-6.
19. Rost B, Yachdav G and Liu J (2003). The PredictProtein Server. Nucleic Acids Research, 32:W321-W326.
20. Krogh A, Larsson B, von Heijne G, and Sonnhammer ELL (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology, 305(3):567-580.
21. Moller S, Croning MDR, Apweiler R (2001). Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics, 17(7):646-653.
22.
23. Sonnhammer ELL, Eddy SR and Durbin R (1997). Pfam: a comprehensive database of protein families based on seed alignments. Proteins, 28:405-420.
24. Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Stærfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A and Brunak S.(2002) Ab initio prediction of human orphan protein function from post-translational modifications and localization features. Journal of Molecular Biology, 319:1257-1265.
25. Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S and König R (2006). GOPET: A tool for automated predictions of Gene Ontology terms. BMC Bioinformatics, 7:161.
26. Christian Schaefer
27. Dosztányi Z, Csizmók V, Tompa P and Simon I (2005). IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 21:3433-3434.
28. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF and Jones DT (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. Journal of Molecular Biology, 337:635-645.
29. Hirose1 S, Shimizu K, Kanai S, Kuroda Y and Noguchi T (2007). POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics, 23(16): 2046-2053.
30. Rani MPR, Obradovic Z, and Dunker AK (1998) Annotation of PDB with respect to "Disordered Regions" in proteins. Genome Informatics 9:240-241.
COMPARATIVE MODELING
07.06
About Comparative Modeling
The functional characterization of a protein from its sequence is one of the most frequent problems in biology. This task usually becomes easier when a tridimensional structure for the protein is available. In the absence of an experimentaly solved structure, comparative modeling techniques can be employed, given that the protein is related to at least one known structure (solved by X-ray or nuclear magnetic resonance). Comparative modeling methods are able to predict, with a certain level of probability, the tridimensional structure of a given protein sequence (target) based mainly on its alignment to one or more proteins of known structure (template). The protocol for comparative modeling algorithms is classically divided in four steps: i) the identification of template sequence(s) with the highest level of similarity to the target protein; ii) the generation of a sequence alignment between those proteins; iii) the generation of a set of candidate structures for the target protein, based on the previously generated alignment; iv) the prediction of errors and the selection of the final tridimensional model from the set of candidate structures.
Several computational approaches are available to deal with the modeling process. These tools can require a higher or lower level of intervention from the user and this interaction is usually inversely proportional to the user experience. Ultimately, manual intervention is still extremely important [31]. In the most simple cases, the requires input data consists of the alignment between the target sequence and its template(s) and the atomic coordinates of the protein structure from the template protein(s). Although some web servers are able to start "from scratch", thus requiring only the protein sequence for the target.
Template Assignment
As mentioned, the first step is the template assignment that could be performed for example by a BLAST search against the PDB database to find related sequences with known structures. Another useful tool for template assignment is HHPRED that can be accessed through its web server version. Both algorithms were used and the results were compared in order to reach an unique set of template structures. The figure below (figure 25) shows the templates assigned by each of the methods. Although the results can present obvious differences, after a deeper analysis, it is possible to establish connections between the assigned templates within the PDB database.
<center|font size=2>Figure 25: Summary of results for template assignment obtained from the BLAST search against the PDB database and from HHPRED.
From this schematic view, it is easy to notice that the second result is the most extensive since the template proteins cover more of the target sequence. As stated, the templates assigned by each method (BLAST and HHPRED) are somehow related. For the first template protein, 2FBT according to BLAST or 2E6M according to HHPRED, the informations on PDB state that these are analogous proteins that share 70% of sequence similarity. For the second template, the longest one, the proteins are actually 100% similar. For the last template, the PDB entry 1D8B is also related to the entry 2E1E. Additionally, for the result retrieved from HHPRED, the end portion of the WRN protein could be modeled by the protein with PDB identifier 3IUO, the C-terminal domain of the ATP-dependent DNA helicase from Porphyromonas gingivalis . Although, as stated, the templates suggested by HHPRED cover more of the target sequence, the similarity level between the template sequence and the target sequence is low, in general. For that reason, the templates assigned by the BLAST search against the PDB database were chosen. They are as follows (ans as summarized in figure 25):
2FBT - Covers the region between residues 38 and 236 of the WRN protein with 199 identical residues from the 199 aligned residues (100% identity
between the sequences). It corresponds to the exonuclease domain of the human WRN.
1OYY - Covers the region between residues 540 and 1030 of the WRN protein with 37% (183 out of 498) identical residues, 56% positive residues and 7% of open gaps in the alignment with the WRN protein. It is annotated as the catalytic core of the E.coli RecQ helicase.
2E1E - Covers the region between residues 1142 and 1242 of the WRN protein with 100% coverage and 100% identity. It stands for the crystal structure of the HRDC Domain of human WRN protein.
Below there is a summary of the regions predicted from homology searches, disorder predictions and signal peptide searches. A direct consequence of the presence of disordered regions is that these portions of the protein will not be correctly modeled, since there is no suitable template for it.
<center|font size=2>Figure 26: Regions in the WRN protein.
When the templates are assigned, the next step is to generate the sequence alignment between these proteins. Unfortunately, none of the fore mentioned programs was able to correctly align the target sequence to the multiple templates.
Model Generation
The first approach to generate the tridimensional models is to use the iTasser server, which is an automated method for generating models from protein sequences. The server generated five different structures which are shown below (figure 27).
<center|font size=2>Figure 27: Miniature figure of the five models generated by iTasser.
As expected, the model contains 'unfolded' regions which come as a result of the lack of template structures. The long disordered region cannot be modeled since there is no suitable template structure available for it (since it is not possible to crystallize such flexible region). This issue will set an obstacle for the modeling process. As it is not possible to generate a structure that covers the entire protein, the model will have to be build in parts. The schematic view of the protein sequence and the defined domains it contains (figure 26) shows two portions of the protein, separated by the disordered region. The first portion enclosures an exonuclease domain, which is an unique feature of the WRN protein. Among the five human RecQ DNA helicases, this is the only one presenting a functional exonuclease domain. This domain was characterized in its structure, biochemical activity and participation in DNA end joining [32]. This region is isolated from the rest of the protein by the disordered region. The C-terminal half of the protein contains at least three domains and the signal peptide (NLS). These domains are (in order from the N-terminal) the RecQ domain which is the longest domain whithin the protein sequence, the helicase-and-ribonuclease D/C-terminal (HRDC) domain, of unknown function (recently pointed out to be important for dissolution of double Holliday junctions [33]) and a domain which was only characterized as a C-terminal domain. After these domains (briefly separated from each other by a few residues predicted as secondary disordered regions) is the nuclear localization signal. Therefore, one alternative would be to build two separate models. One covering the exonuclease domain and the other covering the RecQ, HRDC and C-terminal domains of the WRN protein.
References
32. Perry JJ, Yannone SM, Holden LG, Hitomi C, Asaithamby A, Han S, Cooper PK, Chen DJ, Tainer, JA (2006) WRN exonuclease structure and molecular mechanism imply an editing role in DNA end processing. Nature Structural Molecular Biology 13:414-422
33. Wu L, Chan KL, Ralf C, Bernstein DA, Garcia PL, Bohr VA, Vindigni A, Janscak P, Keck JL and Hickson ID (2005). The HRDC domain of BLM is required for the dissolution of double Holliday junctions. The EMBO Journal, 24:2679 - 2687