Canavan Disease: Task 04 - Structural Alignments
Structural Alignments are an important step to validate protein structure similarity. In this Task structures with different sequence similarity to aspartoacylase (ASPA) are used to compare the results of different alignment approaches.
To gain the dataset as desired first a reference sequence (2I3C) for aspartoacylase was chosen. Then the dataset was generated using this sequence fulfilling different criteria as displayed in <xr id="dataset"></xr>. The full composition and additional information can also be found. <figtable id="dataset">
|2I3C||ASPA from Human||reference structure|
|2O4H||ASPA from Human with bound N-phosphonomethyl-L-aspartate||sequence identity 100% & bound active centre|
|2Q51||ASPA from Human (Ensemble refinement)||sequence identity 100% & unbound active centre|
|2GU2||ASPA from Rat||seq. identity >60%|
|2QJ8||ASPA family protein from mesorhizobium loti||sequence identity <30%|
|1AYE||Procarbooxypeptidase from Human||similar CATH classification for CAT|
|1BKJ||FMN Oxireductase from vibrio harveyi||similar CATH classification for CA|
|1BD0||Alanine racemase||similar CATH classification for C|
|1B3U||Regulatory domain of human PP2A||completely different CATH classification|
Sequence identity and CATH classification similarities with respect to reference sequence 2I3C.
Structural Alignment Exploration
2O4H vs. 2I3C
2O4H was found via the sequence search tab for the reference sequence 2I3C. The structure was chosen due to the fact that it is contained in the 100% sequence identity cluster. Additionally it has a bound compound at the active site. However it is not N-acetyl-L-aspartate, but N-Hydroxy(methyl)phosphoric-L-aspartate binding to the same active center. This compound is not degraded through the enzymatic activity of the protein but "blocks" the active center and therefore the potential change in conformation of the protein can be captured by X-Ray crystallography.
Due to the fact that 2O4H and 2I3C have 100% sequence identity, the structural alignment via Pymol works very accurate. Both structures are the same within the bounds of the accuracy of X-ray crystallography. The RMSD between 2OH4 and 2I3C, calculated by the alignment process of Pymol is 0.445Å. As the measure for the divergence is smaller than the possible to reach resolution of the structure they can be safely considered to be identical. The visual representation of the structural alignment is displayed in <xr id="2O4H_pymol"></xr>. Additionally the possible conformational change of the protein due to the bound substrate in the active site can not be observed. <figure id="2O4H_pymol">
2Q51 vs 2I3C
2Q51 was chosen to complement 2OH4 as it is annotated as the same sequence (100% sequence identity to 2I3C) but without a bound compound in the active center. Assuming that both 2Q51 and 2I3C share the same sequence and the property that both crystallized structures have no bound compound at the active site the result should show that both of them are identical in 3D structure as well (within resolution boundaries). However if comparing both structures with the aid of Pymol (see <xr id="2Q51_pymol"></xr>) it is visible that they in fact differ at least slightly. They share a RMSD of 0.223Å which is smaller compared to the RMSD between 2O4H and 2I3C (0.314Å), nevertheless if compared visually they show different lengths of beta-strands and small variance in their conformation. Double checking the experimental origin of the PDB-structure 2Q51 revealed that the atom coordinates and the conformation of the secondary structure elements were derived as a mean of multiple experimental 3D structure assignments, using X-ray crystallography. This fact is most certain the reason why the difference of the RMSD between the C-alpha atoms is that small, but the visual difference is bigger between 2Q51 and 2I3C than between 2OH4 and 2I3C (see above). <figure id="2Q51_pymol">
2GU2 vs 2I3C
2GU2 is the ASPA ortholog in rat. Due to its sequence similarity of 84% to the human ASPA protein (2I3C) this protein was chosen to represent the group of protein structures with a sequence similarity between 60% and 100%. Performing the structural alignment with the aid of Pymol reveals that the difference in the sequence between both proteins is the result of an extension of the N- and C-terminal ends of the protein which form a beta sheet in 2GU2 that is not present in 2I3C (see <xr id="2GU2_pymol"></xr>). Otherwise the sequences are (in terms of 3D structure) identical within the borders of resolution. This is also reflected in the the RMSD of 0.493Å between the two aligned structures. In this example one important dogma, namely that structure is better conserved that sequence can be observed very well. Despite having only about 80% sequence similarity the three dimensional structure of the two proteins is nearly identical. <figure id="2GU2_pymol">
2QJ8 vs 2I3C
2QJ8 is a family member of the ASPA protein family. The protein originates from mesorhizobium loti a gram negative bacterium and has a sequence similarity of below 30% if aligned to 2I3C. The superimposition of the two proteins using Pymol demonstrates the previously mentioned dogma of a far better structure conservation than conservation of sequence even better than the previous example (see <xr id="2GU2_pymol"></xr>). Despite the sequence similarity of less than 30% the overall shape of 2QJ8 is conserved very good and has a high resemblance to 2I3C as well as a calculated RMSD of 3.474Å. Focusing on the active site the conservation of the structure is even more visible as it can be seen in <xr id="2QJ8_pymol"></xr>. However it has to kept in mind that the two proteins are part of the same protein family and therefore they have this high structural resemblance despite the low sequence similarity. If comparing two proteins from distinct protein families this effect is not likely to be observed. <figure id="2QJ8_pymol">
1AYE, 1BKJ, 1BD0 and 1B3U are proteins that in decreasing order get more distantly related to 2I3C in terms of CATH classification. Superimposing the structures to 2I3C using Pymol it gets visible that 1AYE despite having the same classification of class, architecture and topology is already as distant in terms of spacial arrangement as it is not possible to find well overlapping structures. This holds true except for small alpha-helical and beta-sheet elements aligning around the active site of both structures (see <xr id="Remaining_pymol"></xr> A), what could be a result of the fact that both protein have a bound zinc ion and catalyze a hydrolyzing reaction. The observation of the disability to reasonably superimpose polypeptides is getting worse as the two proteins that are superimposed have less CATH classes in common (see <xr id="Remaining_pymol"></xr> B-D). 1BD0 for instance has only the CATH class in common with 2I3C and the superimposition seems sterically arbitrary aside from some small overlaps. Concluding it can be stated that superimposition with Pymol only works well if the two structures share same functional features and belong to similar protein families.
A: 1AYE superimposed to 2I3C. 1AYE shares CATH class, architecture and topology with 2I3C. Small alpha-helical and beta-sheet elements are aligned, primarily around the active site
of both proteins. B: 1BKJ superimposed to 2I3C. 1BKJ shares CATH class and architecture with 2I3C. No conformational overlap is detectable. C: 1BD0 superimposed to 2I3C. 1BD0 shares
the CATH class with 2I3C. No conformational overlap is detectable. D: 1B3U superimposed to 2I3C. No CATH classifications are shared with 2I3C. Exactly one alpha helix is well aligned,
where as the superimposition of the rest seems completely arbitrary.
Comparison of SSAP, Topmatch, CE & LGA
Further analysis of structure similarity using spacial conformation distance measures have been performed using LGA, SSAP, Topmatch and CE. In general the trend that could be observed using Pymol to superimpose the structures to each other is visible using the different algorithms as well (see <xr id="comp"></xr>). Basically the highly related proteins (2O4H, 2Q51 and 2GU2) show a very low calculated RMSD and a high calculated sequence identity, whereas the proteins that only have parts of the CATH classification in common with 2I3C (1AYE, 1BKJ, 1BD0 and 1B3U) show rising RMSDs with sinking CATH annotation overlap. The one protein that stands out a bit is 2QJ8 which belongs to the same protein family like 2I3C but has little sequence identity. LGA, Topmatch and CE calculate fairly decent RMSDs but SSAP seems to fail creating the alignment as the results show a calculated sequence identity of only 9% where it truly has a sequence identity of about 25%. One interesting fact to observe is that looking at the calculated RMSD it is immediately visible that SSAP is the only algorithm that uses the given atom records as fixed and does not try to find matching substructures that may be sterically ordered different in the three dimensional space. Therefore the RMSD for the last four proteins calculated by SSAP rapidly increases. This trend is visible as well if looking at the results by LGA, Topmatch and CE, however it is not that strong.
|Comparison of LGA, SSAP, Topmatch & CE|
|Protein||RMSD (Å)||SeqId (%)||RMSD (Å)||SeqId (%)||Er (Å)||SeqId (%)||RMSD (Å)||SeqId (%)|
2O4H, 2Q51 and 2GU2 share the same protein family and a high sequence identity with 2I3C. 2QJ8 shares only the protein family with 2I3C
and a sequence identity of about 25%. 1AYE, 1BKJ, 1BD0 and 1B3U share different properties with 2I3C leading do similar CATH classifications
to certain degree. Same protein family generally results in low RMSD. Sharing only similar CATH classifications does not show a particular good RMSD
however the less the classification is similar the higher is the RSMD. The SSAP result for 2QJ8 seems to be an outlier if compared to the results by
LGA, Topmatch and CE.
Structural Alignment Evaluation
Superimposing the "pseudo-models" created by hhmakemodel to the actual structure of 2I3C with the aid of LGA showed some interesting results. Firstly it seems that models that are created with the results from hhsearch that show a very high probability and a very high e-value generate great LGA results despite big differences in the calculated sequence identity (compare 2GU2 and 3NH4 in <xr id="eval_comp">Table </xr>). Looking at the rest of the results it seems that the primary influence for a good performing LGA alignment (see LGA_S and LGA_Q scores) seems to be the e-Value, and not the probability value calculated by hhsearch.
|Comparison of LGA, SSAP, Topmatch & CE|
|HHSearch results||LGA results|
|Protein||Probability||e-Value||Seq. identity (%)||RMSD (Å)||LGA_S||LGA_Q||Seq. identity (%)|
superimposing models created by hhmakemodel of the found structures by hhsearch against 2I3C. A trend that the e-Value of the
hhsearch results is having a correlation to the LGA results could be inferred.
- Link to Task 01: Canavan Disease
- Link to Task 02: Alignments
- Link to Task 03: Sequence-based Predictions
- Link to Task 04: Structural Alignments
- Link to Task 05: Homology Modelling
- Link to Task 06: Protein Structure Prediction from Evolutionary Sequence Variation
- Link to Task 07: Researching SNPs
- Link to Task 08: Sequence-based Mutation Analysis
- Link to Task 09: Structure-based Mutation Analysis
- Link to Task 10: Normal Mode Analysis