Difference between revisions of "Canavan Disease: Task 04 - Structural Alignments"

From Bioinformatikpedia
(Dataset)
(Remaining Proteins)
 
(3 intermediate revisions by the same user not shown)
Line 2: Line 2:
   
 
== Dataset ==
 
== Dataset ==
To gain the dataset as desired first a reference sequence ([http://www.rcsb.org/pdb/explore/explore.do?structureId=2I3C '''2I3C''']) for aspartoacylase was chosen. Then the dataset was generated using this sequence fulfilling the required criteria. The full composition and additional information can be found in '''<xr id="dataset"></xr>'''.
+
To gain the dataset as desired first a reference sequence ([http://www.rcsb.org/pdb/explore/explore.do?structureId=2I3C '''2I3C''']) for aspartoacylase was chosen. Then the dataset was generated using this sequence fulfilling different criteria as displayed in '''<xr id="dataset"></xr>'''. The full composition and additional information can also be found.
 
<figtable id="dataset">
 
<figtable id="dataset">
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
Line 76: Line 76:
 
|-
 
|-
 
|}
 
|}
<center><small>'''<caption>''' Representation of multiple protein structures superimposed to 2I3C. All structures are represented as cartoon, 2I3C in orange and the superimposed structure in blue. '''A:''' 1AYE superimposed to 2I3C. 1AYE shares CATH class, architecture and topology with 2I3C. Small alpha-helical and beta-sheet elements are aligned, primarily around the active site of both proteins. '''B:''' 1BKJ superimposed to 2I3C. 1BKJ shares CATH class and architecture with 2I3C. No conformational overlap is detectable. '''C:''' 1BD0 superimposed to 2I3C. 1BD0 shares the CATH class with 2I3C. No conformational overlap is detectable. '''D:''' 1B3U superimposed to 2I3C. No CATH classifications are shared with 2I3C. Exactly one alpha helix is well aligned, where as the superimposition of the rest seems completely arbitrary.</caption></small></center>
+
<center><small>'''<caption>''' Representation of multiple protein structures superimposed to 2I3C. All structures are represented as cartoon, 2I3C in orange and the superimposed structure in blue.<br>'''A:''' 1AYE superimposed to 2I3C. 1AYE shares CATH class, architecture and topology with 2I3C. Small alpha-helical and beta-sheet elements are aligned, primarily around the active site<br>of both proteins. '''B:''' 1BKJ superimposed to 2I3C. 1BKJ shares CATH class and architecture with 2I3C. No conformational overlap is detectable. '''C:''' 1BD0 superimposed to 2I3C. 1BD0 shares<br>the CATH class with 2I3C. No conformational overlap is detectable. '''D:''' 1B3U superimposed to 2I3C. No CATH classifications are shared with 2I3C. Exactly one alpha helix is well aligned,<br>where as the superimposition of the rest seems completely arbitrary.</caption></small></center>
 
</figure>
 
</figure>
   

Latest revision as of 12:09, 5 September 2013

Structural Alignments are an important step to validate protein structure similarity. In this Task structures with different sequence similarity to aspartoacylase (ASPA) are used to compare the results of different alignment approaches.

Dataset

To gain the dataset as desired first a reference sequence (2I3C) for aspartoacylase was chosen. Then the dataset was generated using this sequence fulfilling different criteria as displayed in <xr id="dataset"></xr>. The full composition and additional information can also be found. <figtable id="dataset">

Dataset Composition
PDB-id Description Criterium
2I3C ASPA from Human reference structure
2O4H ASPA from Human with bound N-phosphonomethyl-L-aspartate sequence identity 100% & bound active centre
2Q51 ASPA from Human (Ensemble refinement) sequence identity 100% & unbound active centre
2GU2 ASPA from Rat seq. identity >60%
2QJ8 ASPA family protein from mesorhizobium loti sequence identity <30%
1AYE Procarbooxypeptidase from Human similar CATH classification for CAT
1BKJ FMN Oxireductase from vibrio harveyi similar CATH classification for CA
1BD0 Alanine racemase similar CATH classification for C
1B3U Regulatory domain of human PP2A completely different CATH classification
Overview of the dataset composition for Task 04, containing a brief description of the the chosen structures.
Sequence identity and CATH classification similarities with respect to reference sequence 2I3C.

</figtable>

Structural Alignment Exploration

Pymol

2O4H vs. 2I3C

2O4H was found via the sequence search tab for the reference sequence 2I3C. The structure was chosen due to the fact that it is contained in the 100% sequence identity cluster. Additionally it has a bound compound at the active site. However it is not N-acetyl-L-aspartate, but N-Hydroxy(methyl)phosphoric-L-aspartate binding to the same active center. This compound is not degraded through the enzymatic activity of the protein but "blocks" the active center and therefore the potential change in conformation of the protein can be captured by X-Ray crystallography.

Due to the fact that 2O4H and 2I3C have 100% sequence identity, the structural alignment via Pymol works very accurate. Both structures are the same within the bounds of the accuracy of X-ray crystallography. The RMSD between 2OH4 and 2I3C, calculated by the alignment process of Pymol is 0.445Å. As the measure for the divergence is smaller than the possible to reach resolution of the structure they can be safely considered to be identical. The visual representation of the structural alignment is displayed in <xr id="2O4H_pymol"></xr>. Additionally the possible conformational change of the protein due to the bound substrate in the active site can not be observed. <figure id="2O4H_pymol">

Representation of 2OH4 aligned to 2I3C. Both structures are displayed as cartoon, 2OH4 in black, 2I3C in orange. The zinc ion at the active site is represented as gray sphere, and the N-Hydroxy(methyl)phosphoric-L-aspartate is represented as balls and sticks at the active site. With a calculated RMSD of 0.445Å both structures can be considered the same as the divergence is even lower than the possible resolution of the crystal structure.

</figure>

2Q51 vs 2I3C

2Q51 was chosen to complement 2OH4 as it is annotated as the same sequence (100% sequence identity to 2I3C) but without a bound compound in the active center. Assuming that both 2Q51 and 2I3C share the same sequence and the property that both crystallized structures have no bound compound at the active site the result should show that both of them are identical in 3D structure as well (within resolution boundaries). However if comparing both structures with the aid of Pymol (see <xr id="2Q51_pymol"></xr>) it is visible that they in fact differ at least slightly. They share a RMSD of 0.223Å which is smaller compared to the RMSD between 2O4H and 2I3C (0.314Å), nevertheless if compared visually they show different lengths of beta-strands and small variance in their conformation. Double checking the experimental origin of the PDB-structure 2Q51 revealed that the atom coordinates and the conformation of the secondary structure elements were derived as a mean of multiple experimental 3D structure assignments, using X-ray crystallography. This fact is most certain the reason why the difference of the RMSD between the C-alpha atoms is that small, but the visual difference is bigger between 2Q51 and 2I3C than between 2OH4 and 2I3C (see above). <figure id="2Q51_pymol">

Representation of 2Q51 aligned to 2I3C. Both structures are displayed as cartoon, 2Q51 in blue, 2I3C in orange. The zinc ion at the active site is represented as gray sphere. The calculated RMSD between the C-alpha atoms of the two structures is as small as 0.223Å, however the displayed secondary structure elements vary in length and sterical conformation (see the beta-strands in the lefter loop region of the protein). The reason may be that the atom coordinates of 2Q51 represent a mean of multiple X-ray crystallography experiments to determine the structure of ASPA.

</figure>

2GU2 vs 2I3C

2GU2 is the ASPA ortholog in rat. Due to its sequence similarity of 84% to the human ASPA protein (2I3C) this protein was chosen to represent the group of protein structures with a sequence similarity between 60% and 100%. Performing the structural alignment with the aid of Pymol reveals that the difference in the sequence between both proteins is the result of an extension of the N- and C-terminal ends of the protein which form a beta sheet in 2GU2 that is not present in 2I3C (see <xr id="2GU2_pymol"></xr>). Otherwise the sequences are (in terms of 3D structure) identical within the borders of resolution. This is also reflected in the the RMSD of 0.493Å between the two aligned structures. In this example one important dogma, namely that structure is better conserved that sequence can be observed very well. Despite having only about 80% sequence similarity the three dimensional structure of the two proteins is nearly identical. <figure id="2GU2_pymol">

Representation of 2GU2 aligned to 2I3C. Both structures are displayed as cartoon, 2GU2 in turquoise, 2I3C in orange. The zinc ion at the active site is represented as gray sphere. Apart form the N- and C-terminal ends of the 2GU2 peptide chain which form a beta-sheet both structures are identical. The calculated RMSD between the two structures is 0.493Å.

</figure>

2QJ8 vs 2I3C

2QJ8 is a family member of the ASPA protein family. The protein originates from mesorhizobium loti a gram negative bacterium and has a sequence similarity of below 30% if aligned to 2I3C. The superimposition of the two proteins using Pymol demonstrates the previously mentioned dogma of a far better structure conservation than conservation of sequence even better than the previous example (see <xr id="2GU2_pymol"></xr>). Despite the sequence similarity of less than 30% the overall shape of 2QJ8 is conserved very good and has a high resemblance to 2I3C as well as a calculated RMSD of 3.474Å. Focusing on the active site the conservation of the structure is even more visible as it can be seen in <xr id="2QJ8_pymol"></xr>. However it has to kept in mind that the two proteins are part of the same protein family and therefore they have this high structural resemblance despite the low sequence similarity. If comparing two proteins from distinct protein families this effect is not likely to be observed. <figure id="2QJ8_pymol">

Representation of 2QJ8 aligned to 2I3C. Both structures are displayed as cartoon, 2QJ8 in green, 2I3C in orange. The zinc ion at the active site is represented as gray sphere. Both proteins share the same protein family while having a sequence similarity of less than 30%. The fact of the same protein family results however in a high structural resemblance of the two proteins despite the sequence similarity. The RMSD calculated by the superimposition is 3.474Å.

</figure>

Remaining Proteins

1AYE, 1BKJ, 1BD0 and 1B3U are proteins that in decreasing order get more distantly related to 2I3C in terms of CATH classification. Superimposing the structures to 2I3C using Pymol it gets visible that 1AYE despite having the same classification of class, architecture and topology is already as distant in terms of spacial arrangement as it is not possible to find well overlapping structures. This holds true except for small alpha-helical and beta-sheet elements aligning around the active site of both structures (see <xr id="Remaining_pymol"></xr> A), what could be a result of the fact that both protein have a bound zinc ion and catalyze a hydrolyzing reaction. The observation of the disability to reasonably superimpose polypeptides is getting worse as the two proteins that are superimposed have less CATH classes in common (see <xr id="Remaining_pymol"></xr> B-D). 1BD0 for instance has only the CATH class in common with 2I3C and the superimposition seems sterically arbitrary aside from some small overlaps. Concluding it can be stated that superimposition with Pymol only works well if the two structures share same functional features and belong to similar protein families.

<figure id="Remaining_pymol">

A: 1AYE vs 2I3C
B: 1BKJ vs 2I3C
C: 1BD0 vs 2I3C
D: 1B3U vs 2I3C
Representation of multiple protein structures superimposed to 2I3C. All structures are represented as cartoon, 2I3C in orange and the superimposed structure in blue.
A: 1AYE superimposed to 2I3C. 1AYE shares CATH class, architecture and topology with 2I3C. Small alpha-helical and beta-sheet elements are aligned, primarily around the active site
of both proteins. B: 1BKJ superimposed to 2I3C. 1BKJ shares CATH class and architecture with 2I3C. No conformational overlap is detectable. C: 1BD0 superimposed to 2I3C. 1BD0 shares
the CATH class with 2I3C. No conformational overlap is detectable. D: 1B3U superimposed to 2I3C. No CATH classifications are shared with 2I3C. Exactly one alpha helix is well aligned,
where as the superimposition of the rest seems completely arbitrary.

</figure>

Comparison of SSAP, Topmatch, CE & LGA

Further analysis of structure similarity using spacial conformation distance measures have been performed using LGA, SSAP, Topmatch and CE. In general the trend that could be observed using Pymol to superimpose the structures to each other is visible using the different algorithms as well (see <xr id="comp"></xr>). Basically the highly related proteins (2O4H, 2Q51 and 2GU2) show a very low calculated RMSD and a high calculated sequence identity, whereas the proteins that only have parts of the CATH classification in common with 2I3C (1AYE, 1BKJ, 1BD0 and 1B3U) show rising RMSDs with sinking CATH annotation overlap. The one protein that stands out a bit is 2QJ8 which belongs to the same protein family like 2I3C but has little sequence identity. LGA, Topmatch and CE calculate fairly decent RMSDs but SSAP seems to fail creating the alignment as the results show a calculated sequence identity of only 9% where it truly has a sequence identity of about 25%. One interesting fact to observe is that looking at the calculated RMSD it is immediately visible that SSAP is the only algorithm that uses the given atom records as fixed and does not try to find matching substructures that may be sterically ordered different in the three dimensional space. Therefore the RMSD for the last four proteins calculated by SSAP rapidly increases. This trend is visible as well if looking at the results by LGA, Topmatch and CE, however it is not that strong.

<figtable id="comp">

Comparison of LGA, SSAP, Topmatch & CE
LGA SSAP (CATH) Topmatch CE
Protein RMSD (Å) SeqId (%) RMSD (Å) SeqId (%) Er (Å) SeqId (%) RMSD (Å) SeqId (%)
2O4H 0.74 97.99 1.04 99 0.65 100 1.02 100
2Q51 1.04 100 1.04 100 1.04 100 1.00 100
2GU2 0.97 86.29 1.23 86 0.91 87 0.97 84.59
2QJ8 2.57 21.53 8.39 9 3.51 17 3.14 13.86
1AYE 2.58 15.24 4.19 8 2.85 13 3.83 7.14
1BKJ 3.26 1.64 18.59 3 2.87 6 4.27 4.04
1BD0 3.36 5.48 20.38 3 2.91 11 5.74 1.71
1B3U 3.57 11.11 28.05 6 3.34 9 6.05 6.30
A comparison of calculated RMSDs and sequence identities by different algorithms for the superimposition of multiple proteins to 2I3C.
2O4H, 2Q51 and 2GU2 share the same protein family and a high sequence identity with 2I3C. 2QJ8 shares only the protein family with 2I3C
and a sequence identity of about 25%. 1AYE, 1BKJ, 1BD0 and 1B3U share different properties with 2I3C leading do similar CATH classifications
to certain degree. Same protein family generally results in low RMSD. Sharing only similar CATH classifications does not show a particular good RMSD
however the less the classification is similar the higher is the RSMD. The SSAP result for 2QJ8 seems to be an outlier if compared to the results by
LGA, Topmatch and CE.

</figtable>

Structural Alignment Evaluation

Superimposing the "pseudo-models" created by hhmakemodel to the actual structure of 2I3C with the aid of LGA showed some interesting results. Firstly it seems that models that are created with the results from hhsearch that show a very high probability and a very high e-value generate great LGA results despite big differences in the calculated sequence identity (compare 2GU2 and 3NH4 in <xr id="eval_comp">Table </xr>). Looking at the rest of the results it seems that the primary influence for a good performing LGA alignment (see LGA_S and LGA_Q scores) seems to be the e-Value, and not the probability value calculated by hhsearch.

<figtable id="eval_comp">

Comparison of LGA, SSAP, Topmatch & CE
HHSearch results LGA results
Protein Probability e-Value Seq. identity (%) RMSD (Å) LGA_S LGA_Q Seq. identity (%)
2GU2 100 8E-11 86 0.97 97.3 27.9 97.7
3NH4 100 2.2E-84 43 1.28 94.3 21.9 97.7
2QJ8 99.3 7E-16 25 2.61 37.6 6.8 57.61
1UWY 78.7 0.44 29 2.23 15.9 2.5 77.97
2BOA 64.9 1.6 19 1.85 16.1 2.7 71.7
1JQG 29.3 15 19 2.10 13.7 2.4 69.8
An overview of the potential correlation of scores calculated for proteins found by hhsearch and LGA scores generated by
superimposing models created by hhmakemodel of the found structures by hhsearch against 2I3C. A trend that the e-Value of the
hhsearch results is having a correlation to the LGA results could be inferred.

</figtable>

Tasks