Difference between revisions of "Task 4: Structural Alignments"

Latest revision as of 14:12, 2 October 2013

<css> table.colBasic2 { margin-left: auto; margin-right: auto; border: 2px solid black; border-collapse:collapse; width: 70%; }

.colBasic2 th,td { padding: 3px; border: 2px solid black; }

.colBasic2 td { text-align:left; }

.colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;}

</css>

Explore structural alignments

lab journal task 4

PDB structures selection

We first selected a set of structures that span different ranges of sequence identity to the reference structure (1A6Z). The domain A of the reference structure has the CATH annotation 3.30.500.10.9 (Murine Class I Major Histocompatibility Complex H2-DB subunit A domain 1) and the domain b 2.60.40.10 (immunoglobulins). We decided to take the domain A as template and only searched for structures with a similar annotation to 3.30.500.10.9, since the immunglobulin domain is only bound to the protein and not directly connected. Also, because the disease causing mutations are all located in the MHC domain. <xr id="selected structures"/> list the structures, their CATH numbers and percent sequence identity to the reference. Unfortunately, we could not find a structure with a sequence identity over 60%. The most similar structure we could find was 1qvo with 39% identity.

**Table 1:** Table of the selected pdb structures, the chain, the CATH annotation, their sequence identity to the reference 1A6Z_A and the protein type.
category	ID	chain	domain	CATH number	Sequence identity (%)	protein (organism)
reference	1A6Z	A	1	3.30.500.10	-	HFE (Homo sapiens)
identical sequence	1DE4	A	1	3.30.500.10	100	HFE (Homo sapiens)
> 30% SeqID	1QVO	A	01	3.30.500.10	39	HLA class I histocompatibility antigen, A-11 alpha chain (Homo sapiens)
< 30% SeqID	1S7X	A	00	3.30.500.10	29	H-2 class I histocompatibility antigen, D-B alpha chain (Mus musculus)
CAT	2IA1	A	01	3.30.500.20	11.1	BH3703 protein (Bacillus halodurans)
CA	3NCI	A	01	3.30.342.10	5.8	DNA polymerase (Enterobacteria phage RB69)
C	1VZY	A	01	3.55.30.10	2.8	33 KDA CHAPERONIN (Bacillus subtilis)
different CATH	1MUS	A	01	1.10.246.40	12.6	Tn5 transposase (Escherichia coli)

</figtable>

Results

In Pymol, each structure from <xr id="selected structures"/> was aligned to the reference 1A6Z_A using only the C_alpha atoms and also using all the atoms. The resulting RMSD values are specified in <xr id="score results"/>. The numbers in brackets after the RMSD values indicate the number of aligned residues that were used to compute the corresponding values.

**Table 2:** Results of the structural alignments of the selected proteins to the template 1A6Z_A. The different alignment scores are listed for each method and the numbers of equivalent residues are stated in brackets after the RMSD.
PDB ID	Seq. identity (%)	Pymol		LGA		SSAP		TopMatch			CE
		RMSD (only C_alpha)	RMSD (all atom)	RMSD	LGA_S	RMSD	SSAP_Score	RMSD	S	S_r	RMSD	Score
1DE4_A	100	0.675 (237)	0.767 (1836)	1.14 (267)	95.77	1.60 (272)	93.07	1.08 (266)	260	1.03	1.19 (267)	543
1QVO_A	39	2.165 (233)	2.279 (1565)	2.29 (259)	67.86	2.58 (268)	86.39	2.62 (259)	228	2.50	2.44 (266)	432
1S7X_A	29	1.889 (233)	2.049 (1557)	2.12 (256)	71.90	2.36 (267)	86.25	2.66 (259)	227	2.56	2.29 (265)	342
2IA1_A	11.1	18.132 (74)	18.283 (501)	2.83 (86)	19.44	15.85 (140)	56.19	2.91 (89)	76	2.82	3.93 (93)	300
3NCI_A	5.8	16.561 (26)	17.329 (178)	3.11 (84)	17.19	14.54 (168)	30.18	3.05 (63)	53	2.94	4.47 (75)	333
1VZY_A	2.8	6.260 (29)	6.951 (168)	3.25 (63)	13.44	26.34 (208)	58.01	2.61 (77)	68	2.53	5.80 (91)	245
1MUS_A	12.6	23.521 (180)	23.891 (1143)	2.82 (69)	16.02	18.53 (215)	46.30	3.58 (88)	69	3.43	6.61 (78)	379

</figtable>

Images of the superimposed structures, using the C_alpha atoms, are shown in <xr id="pymol str. al."/>. The pictures show clearly that a successful superposition is only possible if the two structures share a certain level of sequence identity. 1QVO_A could be aligned to the reference with a low RMSD (39% sequence identity), but 1S7X_A has a even lower value, although the sequence identity is smaller (29%). This could be explained by the fact that 1S7X_A is the exact mouse ortholog of the human Murine Class I Major Histocompatibility Complex H2-DB chain A (1A6Z_A) and therefore has a nearly identical structure. Apart from the three structures 1DE4_A, 1QVO_A and 1S7X_A, the other proteins could not really be superimposed to the reference. This is indicated by the high RMSD values in column 3 and also the low number of equivalent residues in <xr id="score results"/>. Using all the atoms for the computation of the RMSD did not increase the quality of the alignments and the RMSD, see column 4 <xr id="score results"/>. Instead, it lead to a overall higher RMSD.

**Figure 1:** Visualisation of the pairwise structural alignments of all selected proteins to the template 1A6Z_A using the C_alpha atoms. The template is shown in green and the target in red.
	1DE4_A (red) aligned to 1A6Z_A (green). The sequences are identical and thus the alignment is perfect.	1QVO_A (red) aligned to 1A6Z_A (green). Both proteins share a high sequence identity and could be aligned quite good.	1S7X_A (red) aligned to 1A6Z_A (green). Although the two sequences only share 29% sequence identity, they could be aligned very good. This can be explained by the fact that the proteins are orthologs from two different species.	2IA1_A (red) aligned to 1A6Z_A (green). The two proteins could not be aligned well despite the fat that they share the same CAT numbers.	3NCI_A (red) aligned to 1A6Z_A (green). The alignment was not successful.	1VZY_A (red) aligned to 1A6Z_A (green). Because the proteins only share the same C number, the alignment is not good.	1MUS_A (red) aligned to 1A6Z_A (green). The proteins have completely different CATH annotations and therefore different structures that cannot be aligned.

</figure>

Different structural alignments were applied, in addition to Pymol,to superimpose all the structures to the reference. The resulting alignments scores are specified in <xr id="score results"/>. RMSD values vary between different methods, but this can be explained with the varying number of equivalent residues each method found. The more residues aligned, the higher is the RMSD.

LGA is the best method for finding good local superpositions, this can be seen with the very low RMSD values for structures with low sequence identity. Nevertheless, the LGA_S score gives a good impression of how similar the two structures are globally. Very similar structures get a high value near 100 and divergent structures only a score of 13-20, in our case.

SSAP could align the most residues in comparison to the other methods. But the SSAP_Score, ranging from 0 to 100, is re 1S7X_Alatively high for the structures with low sequence identity. For example 1MUS_A has a score of 46.30, although the two protein do not share a common fold. This leads to a false impression of structural similarity.

TopMatch also has overall low RMSD values and the numbers of equivalent residues are comparable to those in LGA. However, 3 values must be taken into account to get the right impression of how similar two structures are. The score S, S_r and also the number of aligned residues. For example, both 1QVO_A and 1VZY_A have an S_r of approximately 2.5 and the difference between the number of equivalent residues and S is also low for both (1QVO_A: 31, 1VZY_A: 9). Only if you take into account that both proteins have roughly the same length and that the percentage of aligned residues from 1QVO_A is higher, it gets obvious that 1QVO_A is structurally more related to the reference than 1VZY_A.

CE is difficult to interpret since we could not find an explanation for the Score in any paper. Also, the RMSD seems not to be correlated with the Score. A low RMSD does imply a higher score, but there are some exceptions. For example, the scores of the alignments with 1MUS_A and 1S7X_A are very similar, but the RMSD of the latter is more than 4Å lower.

Two LGA superpositions are shown in <xr id="lga examples"/>. LGA could find the local similarity between 2IA1_A and 1A6Z_A, which is the alpha helix on the top as well as some beta sheets. 1MUS_A could be aligned to 1A6Z_A based on the one alpha helix and a few beta sheets. However, the relative low Score LGA_S for both proteins (2IA1_A: 19.44 and 1MUS_A: 16.02) make it clear that the similarity is only local and not globally. Therefore, we find that LGA gives us the best impression of structural relatedness.

**Figure 2:** Visualisation of the pairiwise structural alignments of 2IA1_A and 1MUS_A to the template 1A6Z_A using LGA. The template is shown in green and the target in red.
	LGA alignment of 2IA1_A (red) to 1A6Z_A (green).	LGA alignment of 1MUS_A (red) to 1A6Z_A (green).

</figure>

Structural alignments for evaluating sequence alignments

In this task, we wanted to evaluate LGA structural alignments with the help of sequence alignments. HHsearch and the pdb database were used to generate sequence alignments to the reference.

**Table 3:** Selected 9 sequences from the hhsearch sequence search for pdb structures. The alignment scores are listed together with the LGA scores of the corresponding model.
	LGA					hhsearch
PDB ID	superimposed residues	RMSD	Seq_id	LGA_S	LGA_Q	probability	E-value	seq. identity
3p73_A	257	1.99	96.11	75.9	12.3	100.0	4E-68	34.9
1uvq_B	147	2.71	74.15	34.9	5.2	100.0	3.5E-40	22.2
1bii_B	91	1.89	97.8	30.6	4.6	99.6	4.2E-19	18.0
1i1c_A	70	1.73	97.14	23.7	3.8	98.4	2.1E-10	25.0
2vol_A	68	1.85	89.7	22.4	3.5	97.6	9.4E-08	24.4
1iga_A	58	1.99	74.14	16.0	2.8	96.1	9.4E-05	21.3
2wng_A	62	1.91	95.2	19.1	3.1	94.5	0.0022	14.6
1wwc_A	55	2.52	63.64	14.0	2.1	92.9	0.01	6.9
1rhf_A	51	2.22	74.51	13.6	2.2	79.4	0.4	8.0

</figtable>

In total, we found 449 sequences from which we selected 9 sequences, see <xr id="models"/>. Those pdb structures where then used to construct simple 3D models with hhmakemodel.pl. The models were superimposed with LGA to the reference structure 1A6Z_A. <xr id="models"/> also contains the resulting structural alignment scores.

**Table 4:** Pearson's correlation coefficient between all pair of scores. The log10 of the E-value yield the highest correlation to the number of superimposed residues, the sequence identity (LGA) and the LGA_S score.
	superimposed residues	RMSD	Seq_id	LGA_S	LGA_Q
probability	0.4689	-0.1825	0.4517	0.4977	0.4794
log10( E-value )	-0.9919	-0.1500	-0.9663	-0.9544	-0.2915
seq. identity	0.733	-0.3858	0.5799	0.7784	0.7866

</figtable>

We evaluated if there is a correlation between the sequence alignment scores and the similarity of the model structure to the experimental structure. Pearson's correlation coefficient was computed for all pairs of alignment and LGA scores, see <xr id="correlations"/>. The 10-logarithm of the E-value has very high negative correlations to the superimposed residues, the LGA sequence identity and the LGA_S score. This means, that the smaller the E-value the higher the number of superimposed residues, the sequence identity and the LGA_S, and therefore the quality of the superposition. The best correlation to the RMSD and the LGA_Q has the sequence identity. The higher the sequence identity the smaller the RMSD and the LGA_Q score. Hence, a high sequence identity also increases the quality of the superposition.

It is unexpected that all the correlations of the RMSD value are very close to 0 and thus not significant. We would have expected stronger correlations, since the RMSD gives a good measure of how close two 3D structures are to each other. Probably, an explanation could be that LGA only uses all residues under a defined distance cutoff (superimposed residues) for the calculation of the RMSD. Models from sequences with a lower probability alignments could only be superimposed in local region and thus the RMSD remains quite low, although the global model, most likely, is not that good.

@@ Line 52: / Line 52: @@
 |-
 | different CATH || 1MUS || A || 01 || 1.10.246.40 || 12.6 || Tn5 transposase (Escherichia coli)
-|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 1:''' Table of the selected pdb structures, the chain, the CATH annotation, their sequence identity to the refeerence 1A6Z_A and the protein type.
+|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 1:''' Table of the selected pdb structures, the chain, the CATH annotation, their sequence identity to the reference 1A6Z_A and the protein type.
 |}
 </figtable>
@@ Line 84: / Line 84: @@
 </figtable>
-Images of the superimposed structures, using the C_alpha atoms, are shown in <xr id="pymol str. al."/>. The pictures show clearly that a successful superposition is only possible if the two structures share a certain level of sequence identity. 1QVO_A could be aligned to the reference with a low RMSD (39% sequence identity), but 1S7X_A has a even lower value, although the sequence identity is smaller (29%). This could be explained by the fact that 1S7X_A is the exact mouse ortholog of the human Murine Class I Major Histocompatibility Complex H2-DB chain A (1A6Z_A) and therfore has a nearly identical structure. Apart from the three structures 1DE4_A, 1QVO_A and 1S7X_A, the other proteins could not really be superimposed to the reference, see the high RMSD values in column 3 and also the low number of equivalent residues in <xr id="score results"/>.
+Images of the superimposed structures, using the C_alpha atoms, are shown in <xr id="pymol str. al."/>. The pictures show clearly that a successful superposition is only possible if the two structures share a certain level of sequence identity. 1QVO_A could be aligned to the reference with a low RMSD (39% sequence identity), but 1S7X_A has a even lower value, although the sequence identity is smaller (29%). This could be explained by the fact that 1S7X_A is the exact mouse ortholog of the human Murine Class I Major Histocompatibility Complex H2-DB chain A (1A6Z_A) and therefore has a nearly identical structure. Apart from the three structures 1DE4_A, 1QVO_A and 1S7X_A, the other proteins could not really be superimposed to the reference. This is indicated by the high RMSD values in column 3 and also the low number of equivalent residues in <xr id="score results"/>.
 Using all the atoms for the computation of the RMSD did not increase the quality of the alignments and the RMSD, see column 4 <xr id="score results"/>. Instead, it lead to a overall higher RMSD.
-<figtable id="pymol str. al.">
+<figure id="pymol str. al.">
+{|
-{|  class="wikitable"  style="float: left; margin: 1em 0 0 0; border: 1px solid black;" cellpadding="0"
 !  scope="row" align="left" |
 |  align="center" | [[File:pymol_1de4_ca.png|thumb|200px|1DE4_A (red) aligned to 1A6Z_A (green). The sequences are identical and thus the alignment is perfect.]]
@@ Line 96: / Line 96: @@
 |  align="center" | [[File:pymol_3nci_ca.png|thumb|200px|3NCI_A (red) aligned to 1A6Z_A (green). The alignment was not successful.]]
 |  align="center" | [[File:pymol_1vzy_ca.png|thumb|200px|1VZY_A (red) aligned to 1A6Z_A (green). Because the proteins only share the same C number, the alignment is not good.]]
-|  align="center" | [[File:pymol_1mus_ca.png|thumb|200px|1MUS_A (red) aligned to 1A6Z_A (green). The proteins have completely different CATH annotations and therefore different structrues that cannot be aligned.]]
+|  align="center" | [[File:pymol_1mus_ca.png|thumb|200px|1MUS_A (red) aligned to 1A6Z_A (green). The proteins have completely different CATH annotations and therefore different structures that cannot be aligned.]]
+|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Figure 1:''' Visualisation of the pairwise structural alignments of all selected proteins to the template 1A6Z_A using the C_alpha atoms. The template is shown in green and the target in red.
-|-
-|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 3:''' Visualisation of the pariwise structural alignments of all selected proteins to the template 1A6Z_A using the C_alpha atoms. The template is shown in green and the target in red.
 |}
-</figtable>
+</figure>
 Different structural alignments were applied, in addition to Pymol,to superimpose all the structures to the reference. The resulting alignments scores are specified in <xr id="score results"/>. RMSD values vary between different methods, but this can be explained with the varying number of equivalent residues each method found. The more residues aligned, the higher is the RMSD.
-LGA is the best method for finding good local superpositions, this can be seen with the very low RMSD values for structures with low sequence identity. Nevertheless, the LGA_S score gives a good impression of how similar the two structures are globally. Very similar structures get a high value near 100 and divergent structures only a score of 13-20, in our case.
+'''LGA''' is the best method for finding good local superpositions, this can be seen with the very low RMSD values for structures with low sequence identity. Nevertheless, the LGA_S score gives a good impression of how similar the two structures are globally. Very similar structures get a high value near 100 and divergent structures only a score of 13-20, in our case.
-SSAP could align the most residues in comparison to the other methods. But the SSAP_Score, ranging from 0 to 100, is relatively high for the structures with low sequence identity. For example 1MUS_A has a score of 46.30, although the two protein do not share a common fold. This leads to a false impression of structural similarity.
+'''SSAP''' could align the most residues in comparison to the other methods. But the SSAP_Score, ranging from 0 to 100, is re 1S7X_Alatively high for the structures with low sequence identity. For example 1MUS_A has a score of 46.30, although the two protein do not share a common fold. This leads to a false impression of structural similarity.
-TopMatch also has overall low RMSD values and the numbers of equivalent residues are comparable to those in LGA. However, 3 values must be taken into account to get the right impression of how similar two structures are. The score S, S_r  and also the number of aligned residues. For example, both 1QVO_A and 1VZY_A have an S_r of approximately 2.5 and the difference between the number of euivalent residues and S is also low for both (1QVO_A: 31, 1VZY_A: 9). Only if you take into account that both proteins have roughly the same length and that the percentage of aligned residues from 1QVO_A is higher, it gets obvious that 1QVO_A is structurally more related to the reference than 1VZY_A.
-CE is difficult to interpred since we could not find an explanation for the Score.
+'''TopMatch''' also has overall low RMSD values and the numbers of equivalent residues are comparable to those in LGA. However, 3 values must be taken into account to get the right impression of how similar two structures are. The score S, S_r  and also the number of aligned residues. For example, both 1QVO_A and 1VZY_A have an S_r of approximately 2.5 and the difference between the number of equivalent residues and S is also low for both (1QVO_A: 31, 1VZY_A: 9). Only if you take into account that both proteins have roughly the same length and that the percentage of aligned residues from 1QVO_A is higher, it gets obvious that 1QVO_A is structurally more related to the reference than 1VZY_A.
-Two LGA superpositions are shown in <xr id="lga examples"/>. LGA could find the local similarity between 2IA1_A and 1A6Z_A, which is the alpha helix on the top as well as some beta sheets. 1MUS_A could be aligned to 1A6Z_A based on the one alpha helix and a few beta sheets. However, the relative low Score LGA_S for both proteins (2IA1_A: 19.44 and 1MUS_A: 16.02) make it clear that the similarity is only local and not globally. Therfore, we find that LGA gives us the best impression of structural relatedness.
+'''CE''' is difficult to interpret since we could not find an explanation for the Score in any paper. Also, the RMSD seems not to be correlated with the Score. A low RMSD does imply a higher score, but there are some exceptions. For example, the scores of the alignments with 1MUS_A and 1S7X_A are very similar, but the RMSD
+of the latter is more than 4Å lower.
+Two LGA superpositions are shown in <xr id="lga examples"/>. LGA could find the local similarity between 2IA1_A and 1A6Z_A, which is the alpha helix on the top as well as some beta sheets. 1MUS_A could be aligned to 1A6Z_A based on the one alpha helix and a few beta sheets. However, the relative low Score LGA_S for both proteins (2IA1_A: 19.44 and 1MUS_A: 16.02) make it clear that the similarity is only local and not globally. Therefore, we find that LGA gives us the best impression of structural relatedness.
-<figtable id="lga examples">
-{|  class="wikitable"  style="float: center; margin: 1em 0 0 0; border: 1px solid black;" cellpadding="0"
+<figure id="lga examples">
+{|
 !  scope="row" align="left" |
 |  align="center" | [[File:lga_2ia1.png|thumb|200px|LGA alignment of 2IA1_A (red) to 1A6Z_A (green). ]]
 |  align="center" | [[File:lga_1mus.png|thumb|200px|LGA alignment of 1MUS_A (red) to 1A6Z_A (green).]]
+|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Figure 2:''' Visualisation of the pairiwise structural alignments of 2IA1_A and 1MUS_A to the template 1A6Z_A using LGA. The template is shown in green and the target in red.
-|-
-|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 4:''' Visualisation of the pariwise structural alignments of 2IA1_A and 1MUS_A to the template 1A6Z_A using LGA. The template is shown in green and the target in red.
 |}
-</figtable>
+</figure>
 == Structural alignments for evaluating sequence alignments ==
@@ Line 149: / Line 150: @@
 |-
 |1rhf_A || 51 || 2.22 || 74.51 || 13.6 || 2.2 || 79.4|| 0.4 || 8.0
+|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 3:''' Selected 9 sequences from the hhsearch sequence search for pdb structures. The alignment scores are listed together with the LGA scores of the corresponding model.
-|-
-|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 5:''' Selected 9 sequences from the hhsearch sequence search for pdb structures. The alignment scores are listed together with the LGA scores of the corresponding model.
 |}
 </figtable>
@@ Line 168: / Line 168: @@
 !seq. identity
 | 0.733 || style="background-color: #FFF070; font-weight:bold;" |  -0.3858 || 0.5799 || 0.7784 || style="background-color: #FFF070; font-weight:bold;" |  0.7866
+|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 4:''' Pearson's correlation coefficient between all pair of scores. The log10 of the E-value yield the highest correlation to the number of superimposed residues, the sequence identity (LGA) and the LGA_S score.
-|-
-|+ style="caption-side: bottom; text-align: left" |<font size=2>'''Table 6:''' Pearson's correlation coefficient between all pair of scores. The log10 of the E-value yield the highest correlation to the number of superimposed residues, the sequence identity (LGA) and the LGA_S score.
 |}
 </figtable>
 We evaluated if there is a correlation between the sequence alignment scores and the similarity of the model structure to the experimental structure.
-Pearson's correlation coefficient was computed for all pairs of alignment and LGA scores, see <xr id="correlations"/>. The 10-logarithm of the E-value has very high negative correlations to the superimposed residues, the LGA sequence identity and the LGA_S score. This means, that the smaller the E-value the higher the number of superimposed residues, the sequence identity and the LGA_S, and therfore the quality of the superposition. The best correlation to the RMSD and the LGA_Q has the sequence identity. The higher the sequence identity the smaller the RMSD and the LGA_Q score. Hence, a high sequence identity also increases the quality of the superposition.
+Pearson's correlation coefficient was computed for all pairs of alignment and LGA scores, see <xr id="correlations"/>. The 10-logarithm of the E-value has very high negative correlations to the superimposed residues, the LGA sequence identity and the LGA_S score. This means, that the smaller the E-value the higher the number of superimposed residues, the sequence identity and the LGA_S, and therefore the quality of the superposition. The best correlation to the RMSD and the LGA_Q has the sequence identity. The higher the sequence identity the smaller the RMSD and the LGA_Q score. Hence, a high sequence identity also increases the quality of the superposition.
-It is unexpected that all the correlations to the RMSD are very close to 0 and therefore not significant. Which is counterintuitiv, since the RMSD gives the best measure of how close two 3D structures are to each other.
+It is unexpected that all the correlations of the RMSD value are very close to 0 and thus not significant. We would have expected stronger correlations, since the RMSD gives a good measure of how close two 3D structures are to each other.
+Probably, an explanation could be that LGA only uses all residues under a defined distance cutoff (superimposed residues) for the calculation of the RMSD. Models from sequences with a lower probability alignments could only be superimposed in local region and thus the RMSD remains quite low, although the global model, most likely, is not that good.

Difference between revisions of "Task 4: Structural Alignments"

Latest revision as of 14:12, 2 October 2013

Contents

Explore structural alignments

PDB structures selection

Results

Structural alignments for evaluating sequence alignments

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools