Difference between revisions of "Structural Alignments (Phenylketonuria)"
(→Dataset generation) |
(→Summary) |
||
(110 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Summary == |
== Summary == |
||
+ | Structural alignments are used to determine the functional and evolutionary relationships between protein structures<ref name="struc_align"> Walter Pirovano, K Anton Feenstra and Jaap Heringa (2008): "[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630330/ The meaning of alignment: lessons from structural diversity]". BMC Bioinformatics Vol.9:556. [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://www.biomedcentral.com/1471-2105/9/556 10.1186/1471-2105-9-556] </ref>. In this task, we first generated a dataset of different related and unrelated structures to our protein sequence (PAH). Subsequently, we used different methods and measurements to quantify structural similarity between the given structures. Then, we generated structural alignments for the evaluation of some sequence-based alignments of [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Sequence_searches_and_multiple_sequence_alignments_%28Phenylketonuria%29 Task 2]. The results and appendant discussions are shown below. |
||
− | ... |
||
== Explore structural alignments == |
== Explore structural alignments == |
||
+ | [[Lab Journal - Task 4 (PAH) #Explore structural alignments|Lab journal]] <br> |
||
− | ... |
||
=== Dataset generation === |
=== Dataset generation === |
||
− | Our protein has the CATH Code |
+ | Our protein (PAH) has the CATH Code [http://www.cathdb.info/version/3.5.0/superfamily/1.10.800.10 1.10.800.10] (Phenylalanine Hydroxylase). We used, for the generation of the dataset, similar and dissimilar structures to this protein. Thus, we added the following structures into it: |
* reference structure of PAH: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2PAH 2PAH] (96,41% identity) |
* reference structure of PAH: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2PAH 2PAH] (96,41% identity) |
||
− | * identical sequence with filled binding site: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1LRM 1LRM] (--> pdb entry: looked at 3D structure and saw |
+ | * identical sequence with filled binding site: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1LRM 1LRM] (100% identity --> pdb entry: looked at 3D structure and saw two filled binding site with the ligands: FE and HBI) |
− | * identical sequence with unfilled binding site: |
+ | * identical sequence with unfilled binding site: not found anyone |
* low sequence identity: [http://www.rcsb.org/pdb/explore/explore.do?structureId=3LUY 3LUY] (32,2% - no pdb ID under 30%) |
* low sequence identity: [http://www.rcsb.org/pdb/explore/explore.do?structureId=3LUY 3LUY] (32,2% - no pdb ID under 30%) |
||
* high sequence identity: pdb ID: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2PHM 2PHM] (89,7%) |
* high sequence identity: pdb ID: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2PHM 2PHM] (89,7%) |
||
− | * CAT: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1J8U 1J8U] (CATH Code: 1.10.800.10) - there is no other category than |
+ | * CAT: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1J8U 1J8U] (CATH Code: 1.10.800.10) - there is no other category than this for CAT |
* CA: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2B5U 2B5U] (CATH Code: 1.10.287.620) |
* CA: [http://www.rcsb.org/pdb/explore/explore.do?structureId=2B5U 2B5U] (CATH Code: 1.10.287.620) |
||
* C: [http://www.rcsb.org/pdb/explore/explore.do?structureId=3BQO 3BQO] (CATH Code: 1.25.40.210) |
* C: [http://www.rcsb.org/pdb/explore/explore.do?structureId=3BQO 3BQO] (CATH Code: 1.25.40.210) |
||
* other CATH category: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1V8H 1V8H] (CATH Code: 2.60.40.10) |
* other CATH category: [http://www.rcsb.org/pdb/explore/explore.do?structureId=1V8H 1V8H] (CATH Code: 2.60.40.10) |
||
+ | |||
+ | Now we want to apply different structural alignment methods with this dataset. In this case, each structure has only to be superimposed on the reference structure and not on the other structures too. |
||
=== Pymol === |
=== Pymol === |
||
+ | [http://pymol.org/ Pymol] is a python-enhanced and open source molecular visualization tool. It is particularly suitable for 3D visualization of proteins and small molecules as well as their density, surfaces and trajectories. It also includes molecular editing like aligning or superimposition of two molecules. <ref> http://sourceforge.net/projects/pymol/ short Pymol summary, retrieved June 02, 2013 </ref> In Pymol RMSD values for both all atoms and only Cα atoms can be reported. In <xr id="rmsd"/> the RMSD values for all atomes are shown. The images below, show the 2PAH protein aligned/superimposed to the above named proteins in the data generation part. |
||
− | ... |
||
+ | <small> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
+ | |<figure id="2PAH_1LRM">[[File:2PAH_1LRM.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 1LRM (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH (one can also be found in the 1LRM protein) are shown in grey spheres and the 7,8-Dihydrobiopterin (HBI) ligand of 1LRM is represented in red sticks.</caption>]]</figure> |
||
+ | |<figure id="2PAH_3LUY">[[File:2PAH_3LUY.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 3LUY (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH are shown in grey spheres and the 3-Phenylpyruvic acid (PPY) ligand is represented in red sticks.</caption>]]</figure> |
||
+ | |<figure id="2PAH_2PHM">[[File:2PAH_2PHM.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 2PHM (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH (one can also be found in the 2PHM protein) are shown in grey spheres.</caption>]]</figure> |
||
+ | |- |
||
+ | |<figure id="2PAH_1J8U">[[File:2PAH_1J8U.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 1J8U (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH (one FE2 can also be found in the 1J8U protein) are shown in grey spheres and the 5,6,7,8-Tetrahydrobiopterin (H4B) is represented in red sticks.</caption>]]</figure> |
||
+ | |<figure id="2PAH_2B5U">[[File:2PAH_2B5U.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 2B5U (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH are shown in grey spheres and the Citric acid (CIT) is represented in red sticks.</caption>]]</figure> |
||
+ | |<figure id="2PAH_3BQO">[[File:2PAH_3BQO.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 3BQO (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH are shown in grey spheres. 3BQO has no ligand included.</caption>]]</figure> |
||
+ | |- |
||
+ | |<figure id="2PAH_1V8H">[[File:2PAH_1V8H.png|thumb|left|200px|'''<caption>'''3D structure of the aligned 2PAH (green) and 1V8H (purple) proteins in cartoon style. The two iron (FE) ligands of the protein 2PAH are shown in grey spheres. 1V8H has no ligand included.</caption>]]</figure> |
||
+ | | |
||
+ | | |
||
+ | |- |
||
+ | |} |
||
+ | </small> |
||
+ | Here, one can see, that <xr id="2PAH_1LRM"/>, <xr id="2PAH_2PHM"/> and <xr id="2PAH_1J8U"/> have a high similarity to 2PAH, whereas the other proteins are not so well aligned to our reference-protein. |
||
=== LGA === |
=== LGA === |
||
+ | The [http://proteinmodel.org/AS2TS/LGA/lga.html LGA] (Local-Global Alignment) method affords the possibility to compare fragments or whole protein structures in sequence dependent and independent modes <ref name="lga"> Adam Zemla (2003): "[http://nar.oxfordjournals.org/content/31/13/3370.long LGA: a method for finding 3D similarities in protein structures]". Nucleic Acids Research Vol.31(13):3370-3374. [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://nar.oxfordjournals.org/content/31/13/3370.abstract 10.1093/nar/gkg571] </ref>. It uses the two methods LCS(longest continuous segments) and GDT (global distance test) to detect regions of local and global structural similarity <ref name="slides"> [[File:presentation_structuralAlignments.pdf]]: Slides of Katharinas presentation. </ref>. The generated data can successfully be used in a scoring function to rank two structures related to the level of similarity between them. It allows structure classification when many proteins are analyzed, as well as clustering of similar protein structure fragments <ref name="lga"/> |
||
− | ... |
||
=== SSAP / CATHEDRAL (used by CATH) === |
=== SSAP / CATHEDRAL (used by CATH) === |
||
+ | For the alignment method used by CATH, we utilized the [http://v3-4.cathdb.info/cgi-bin/GetSsapRasmol.pl SSAP Server]. The sequential structure alignment program (SSAP) is a method for comparing protein structures based on distance plots. It computes the residue view of each residue by the set of distance vectors from Cβ atom to Cβ atom of all other residues. <ref name="ssap"> Christine A. Orengo and William R. Taylor (1996): "[http://www.cs.umd.edu/class/spring2003/cmsc838t/papers/orengoandtaylor1996.pdf SSAP: Sequential Structure Alignment Program for Protein Structure Comparison]". Methods in Enzymology Vol.266:617–635. [http://en.wikipedia.org/wiki/PubMed_Identifier#PubMed_identifier PMID]:[http://www.ncbi.nlm.nih.gov/pubmed/8743709/pii/S0076687996660388 8743709] </ref> <br> |
||
− | ... |
||
− | === |
+ | === TopMatch === |
+ | [https://topmatch.services.came.sbg.ac.at/ TopMatch] is a successor of ProSup, a structure comparison tool. It is useful for protein structure alignments, visualization of structural similarities and highlighting relationships between proteins. <ref name="topmatch"> Manfred J. Sippl and Markus Wiederstein (2008): "[http://bioinformatics.oxfordjournals.org/content/24/3/426.full A note on difficult structure alignment problems]". Bioinformatics Vol.24(3): 426-427 [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://bioinformatics.oxfordjournals.org/content/24/3/426.full 10.1093/bioinformatics/btm622] </ref> Thereby, the method represents structures by Cα atoms and joins multiple chains to single ones. <ref name="slides"/> |
||
− | ... |
||
=== SAP or CE === |
=== SAP or CE === |
||
+ | We used the [http://source.rcsb.org/jfatcatserver/ceHome.jsp CE] server to build the structural alignment. CE builds an alignment between two protein structures based on a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs). These AFPs are fragments of each protein, which confer structure similarity and are based on local geometry. It is a fast and accurate algorithm in finding an optimal alignment. <ref> Ilya N. Shindyalov and Philip E. Bourne (1998): "[http://www.sdsc.edu/pb/papers/ce98.pdf Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path]". Protein Engineering Vol.11(9): 739-747. </ref> |
||
− | ... |
||
+ | Furthermore CE is direct available at [http://www.rcsb.org/pdb/workbench/workbench.do RSCB-CE] (RCSB PDB Protein Comparison Tool), where only the algorithmus ''jCEalgorithm'' has to be selected. Additionally a variety of different methods for generating sequence and structural alignments are included here. |
||
=== Modelling scores === |
=== Modelling scores === |
||
+ | To compare the different models, the RMSDs (root-mean-square deviation) are compared. In TopMatch the same formular is taken but called Er (root-mean-square error). The RMSD gives the squared distance between corresponding positions of two superimposed proteins in Ångström. The results are shown in <xr id="rmsd"/>. |
||
− | ... |
||
+ | <figtable id="rmsd"> |
||
+ | {| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="8" style="background:#32CD32;" | RMSD results |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Method |
||
+ | ! style="background:#90EE90;" align="center" | 1LRM |
||
+ | ! style="background:#90EE90;" align="center" | 3LUY |
||
+ | ! style="background:#90EE90;" align="center" | 2PHM |
||
+ | ! style="background:#90EE90;" align="center" | 1J8U |
||
+ | ! style="background:#90EE90;" align="center" | 2B5U |
||
+ | ! style="background:#90EE90;" align="center" | 3BQO |
||
+ | ! style="background:#90EE90;" align="center" | 1V8H |
||
+ | |- |
||
+ | | Pymol || 0.49 || 21.08 || 0.57 || 0.50 || 22.70 || 14.66 || 22.03 |
||
+ | |- |
||
+ | | LGA-RMSD || 0.81 || 3.30 || 0.88 || 0.73 || 3.07 ||3.59 || 3.42 |
||
+ | |- |
||
+ | | SSAP-RMSD || 0.99 || 18.77 || 1.24 || 1.02 || 39.16 || 22.39 || 7.27 |
||
+ | |- |
||
+ | | TopMatch-Er ||0.60 || 1.98 || 0.81 || 0.63 || 1.21 || 1.12 || 3.25 |
||
+ | |- |
||
+ | | CE-RMSD || 0.65 || 5.13 || 0.95 || 0.68 || 4.06 || 4.68 || 5.92 |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>'''Root-mean-square deviation/error in Ångström for Pymol and the four protein structure alignment predictors LGA, SSAP, TopMatch and CE.</caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | Lowest RMSDs were found with TopMatch, but even for unrelated structures TopMatch always finds something to align using local alignments. For example 3LUY with a low sequence identity still has an RMSD of 1.98, however if you look at the structure or the alignment itself only small accordances can be viewed, which also can be caused by chance. Nevertheless, for 1V8H, which is completely distant to our protein PAH, the RMSD can indicate this distance with a value of 3.25. |
||
+ | Looking at LGA and CE sometimes the one, sometimes the other one gives lower RMSDs. CE seems to distinguish a bit better between related and unrelated structures as the differences between the high and low RMSDs are more distinct than in LGA. Neither for TopMatch nor for LGA nor for CE a higher RMSD than six were found for the proteins used in this task. In contrast the RMSD results of SSAP even reached 39.16 for the structure comparison of our protein with the protein which only coincide in class and architecture 2B5U. For very similar structures the RMSD is not as different to the other tools than for distant structures. Still, for the completely unrelated structure 1V8H a RMSD of 7.27 seems to be too low in comparison with the RMSD of 2B5U. Maybe those differences originate from the fact that SSAP uses the Cβ and not the Cα or all atoms for the calculations. |
||
+ | Altogether it can be said that it is not easy to evaluate the different tools even if they give the same parameter as output, because different approaches are used to calculate the distances. |
||
== Evaluate sequence alignments == |
== Evaluate sequence alignments == |
||
+ | [[Lab Journal - Task 4 (PAH) #Evaluate sequence alignments|Lab journal]] <br> |
||
− | ... |
||
+ | In this part the structural alignments are used for the evaluation of the sequence alignments. Therefore we compare the different values of LGA structures with the sequence alignments made by hhsearch. |
||
+ | In <xr id="model_rmsd"/> the results of LGA and hhsearch are shown for eight example proteins. |
||
+ | |||
+ | <figtable id="model_rmsd"> |
||
+ | {| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="8" style="background:#228B22;" | LGA and hhsearch results |
||
+ | |- |
||
+ | ! colspan="1" style="background:#32CD32;" | |
||
+ | ! colspan="4" style="background:#32CD32;" | LGA |
||
+ | ! colspan="3" style="background:#32CD32;" | hhsearch |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | PDB ID |
||
+ | ! style="background:#90EE90;" align="center" | RMSD |
||
+ | ! style="background:#90EE90;" align="center" | LGA_S |
||
+ | ! style="background:#90EE90;" align="center" | LGA_Q |
||
+ | ! style="background:#90EE90;" align="center" | seq_id |
||
+ | ! style="background:#90EE90;" align="center" | probability |
||
+ | ! style="background:#90EE90;" align="center" | e-value |
||
+ | ! style="background:#90EE90;" align="center" | identities(%) |
||
+ | |- |
||
+ | | 1PHZ || 0.83 || 90.65 || 32.44 || 99.34 || 100.00 || 6.9e-165 || 92 |
||
+ | |- |
||
+ | | 1J8U || 0.73 || 90.29 || 35.83 || 99.67 || 100.00 || 3.1e-135 || 100 |
||
+ | |- |
||
+ | | 2V27 || 1.70 || 62.77 || 12.55 || 96.02 || 100.00 || 3.6e-74 || 32 |
||
+ | |- |
||
+ | | 2QMX || 3.18 || 7.46 || 1.25 || 4.88 || 98.20 || 1.1e-09 || 36 |
||
+ | |- |
||
+ | | 3LUY || 2.82 || 7.17 || 1.24 || 13.89 || 98.07 || 3.3e-09 || 22 |
||
+ | |- |
||
+ | | 1QEY || 0.64 || 3.65 || 1.63 || 0.00 || 54.00 || 3.4 || 67 |
||
+ | |- |
||
+ | | 1WYP || 2.67 || 8.43 || 1.37 || 0.00 || 29.42 || 15 || 19 |
||
+ | |- |
||
+ | | 1A6S || 3.15 || 6.93 || 1.08 || 11.43 || 20.59 || 29 || 36 |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Results of LGA comparison of our protein against others, where the proteins are found with hhsearch and the Cαs are located with hhmakemodel.pl. Only the results of eight example proteins are shown. </caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | As LGA_S and LGA_Q are used for the calculation of RMSD they have a high dependency on each other and therefore only RMSD is compared with sequence identities and e-values of the sequence alignments. The probabilities for a true relationship are below 30% for the last two example proteins, but very high for the first five example proteins. Especially for probabilities of 100% low e-values and lower RMSDs are found. Only the third one has a bit worse RMSD value, however, with a lower sequence identity. |
||
+ | So as there seems to be some connection between the different values, we examine if there are any relations between the e-values or the sequence identities of the hhsearch results and the RMSDs of the LGA calculations. Therefore we used the '''Pearson correlation coefficient''', which shows whether and how two variables are dependent on each other (see also [http://en.wikipedia.org/wiki/Pearson%27s_correlation]). |
||
+ | This calculation of Pearson's correlation coefficient is applied on all 26 proteins: |
||
+ | |||
+ | {|align="center" |
||
+ | |- |
||
+ | | RMSD against e-value: |
||
+ | | style="text-align:right;" | 0.4183518 |
||
+ | |- |
||
+ | | RMSD against logarithmic e-value: |
||
+ | | style="text-align:right;" | 0.7256793 |
||
+ | |- |
||
+ | | RMSD against sequence identity: |
||
+ | | style="text-align:right;" | -0.8315063 |
||
+ | |- |
||
+ | |} |
||
+ | |||
+ | Like already seen in <xr id="model_rmsd"/> there seems to be a positive correlation between e-value and RMSD, meaning the lower the e-value the lower the RMSD. However, as the Pearson correlation goes from -1.0 to 1.0 it is not that high with about 0.42. Nevertheless if you take the logarithm of the e-value the correlation is much more pronounced with a value of about 0.73. For sequence identity a strong negative correlation of about -0.83 can be viewed. That a higher sequence similarity results in a more similar structure and therefore in a lower RMSD seems to be reasonable. |
||
== References == |
== References == |
Latest revision as of 20:58, 5 September 2013
Contents
Summary
Structural alignments are used to determine the functional and evolutionary relationships between protein structures<ref name="struc_align"> Walter Pirovano, K Anton Feenstra and Jaap Heringa (2008): "The meaning of alignment: lessons from structural diversity". BMC Bioinformatics Vol.9:556. doi:10.1186/1471-2105-9-556 </ref>. In this task, we first generated a dataset of different related and unrelated structures to our protein sequence (PAH). Subsequently, we used different methods and measurements to quantify structural similarity between the given structures. Then, we generated structural alignments for the evaluation of some sequence-based alignments of Task 2. The results and appendant discussions are shown below.
Explore structural alignments
Dataset generation
Our protein (PAH) has the CATH Code 1.10.800.10 (Phenylalanine Hydroxylase). We used, for the generation of the dataset, similar and dissimilar structures to this protein. Thus, we added the following structures into it:
- reference structure of PAH: 2PAH (96,41% identity)
- identical sequence with filled binding site: 1LRM (100% identity --> pdb entry: looked at 3D structure and saw two filled binding site with the ligands: FE and HBI)
- identical sequence with unfilled binding site: not found anyone
- low sequence identity: 3LUY (32,2% - no pdb ID under 30%)
- high sequence identity: pdb ID: 2PHM (89,7%)
- CAT: 1J8U (CATH Code: 1.10.800.10) - there is no other category than this for CAT
- CA: 2B5U (CATH Code: 1.10.287.620)
- C: 3BQO (CATH Code: 1.25.40.210)
- other CATH category: 1V8H (CATH Code: 2.60.40.10)
Now we want to apply different structural alignment methods with this dataset. In this case, each structure has only to be superimposed on the reference structure and not on the other structures too.
Pymol
Pymol is a python-enhanced and open source molecular visualization tool. It is particularly suitable for 3D visualization of proteins and small molecules as well as their density, surfaces and trajectories. It also includes molecular editing like aligning or superimposition of two molecules. <ref> http://sourceforge.net/projects/pymol/ short Pymol summary, retrieved June 02, 2013 </ref> In Pymol RMSD values for both all atoms and only Cα atoms can be reported. In <xr id="rmsd"/> the RMSD values for all atomes are shown. The images below, show the 2PAH protein aligned/superimposed to the above named proteins in the data generation part.
</figure> </figure> </figure> </figure> </figure> </figure> </figure><figure id="2PAH_1LRM"> | |
<figure id="2PAH_3LUY"> | |
<figure id="2PAH_2PHM"> | |
<figure id="2PAH_1J8U"> | |
<figure id="2PAH_2B5U"> | |
<figure id="2PAH_3BQO"> | |
<figure id="2PAH_1V8H"> | |
Here, one can see, that <xr id="2PAH_1LRM"/>, <xr id="2PAH_2PHM"/> and <xr id="2PAH_1J8U"/> have a high similarity to 2PAH, whereas the other proteins are not so well aligned to our reference-protein.
LGA
The LGA (Local-Global Alignment) method affords the possibility to compare fragments or whole protein structures in sequence dependent and independent modes <ref name="lga"> Adam Zemla (2003): "LGA: a method for finding 3D similarities in protein structures". Nucleic Acids Research Vol.31(13):3370-3374. doi:10.1093/nar/gkg571 </ref>. It uses the two methods LCS(longest continuous segments) and GDT (global distance test) to detect regions of local and global structural similarity <ref name="slides"> File:Presentation structuralAlignments.pdf: Slides of Katharinas presentation. </ref>. The generated data can successfully be used in a scoring function to rank two structures related to the level of similarity between them. It allows structure classification when many proteins are analyzed, as well as clustering of similar protein structure fragments <ref name="lga"/>
SSAP / CATHEDRAL (used by CATH)
For the alignment method used by CATH, we utilized the SSAP Server. The sequential structure alignment program (SSAP) is a method for comparing protein structures based on distance plots. It computes the residue view of each residue by the set of distance vectors from Cβ atom to Cβ atom of all other residues. <ref name="ssap"> Christine A. Orengo and William R. Taylor (1996): "SSAP: Sequential Structure Alignment Program for Protein Structure Comparison". Methods in Enzymology Vol.266:617–635. PMID:8743709 </ref>
TopMatch
TopMatch is a successor of ProSup, a structure comparison tool. It is useful for protein structure alignments, visualization of structural similarities and highlighting relationships between proteins. <ref name="topmatch"> Manfred J. Sippl and Markus Wiederstein (2008): "A note on difficult structure alignment problems". Bioinformatics Vol.24(3): 426-427 doi:10.1093/bioinformatics/btm622 </ref> Thereby, the method represents structures by Cα atoms and joins multiple chains to single ones. <ref name="slides"/>
SAP or CE
We used the CE server to build the structural alignment. CE builds an alignment between two protein structures based on a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs). These AFPs are fragments of each protein, which confer structure similarity and are based on local geometry. It is a fast and accurate algorithm in finding an optimal alignment. <ref> Ilya N. Shindyalov and Philip E. Bourne (1998): "Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path". Protein Engineering Vol.11(9): 739-747. </ref> Furthermore CE is direct available at RSCB-CE (RCSB PDB Protein Comparison Tool), where only the algorithmus jCEalgorithm has to be selected. Additionally a variety of different methods for generating sequence and structural alignments are included here.
Modelling scores
To compare the different models, the RMSDs (root-mean-square deviation) are compared. In TopMatch the same formular is taken but called Er (root-mean-square error). The RMSD gives the squared distance between corresponding positions of two superimposed proteins in Ångström. The results are shown in <xr id="rmsd"/>. <figtable id="rmsd">
RMSD results | |||||||
---|---|---|---|---|---|---|---|
Method | 1LRM | 3LUY | 2PHM | 1J8U | 2B5U | 3BQO | 1V8H |
Pymol | 0.49 | 21.08 | 0.57 | 0.50 | 22.70 | 14.66 | 22.03 |
LGA-RMSD | 0.81 | 3.30 | 0.88 | 0.73 | 3.07 | 3.59 | 3.42 |
SSAP-RMSD | 0.99 | 18.77 | 1.24 | 1.02 | 39.16 | 22.39 | 7.27 |
TopMatch-Er | 0.60 | 1.98 | 0.81 | 0.63 | 1.21 | 1.12 | 3.25 |
CE-RMSD | 0.65 | 5.13 | 0.95 | 0.68 | 4.06 | 4.68 | 5.92 |
</figtable>
Lowest RMSDs were found with TopMatch, but even for unrelated structures TopMatch always finds something to align using local alignments. For example 3LUY with a low sequence identity still has an RMSD of 1.98, however if you look at the structure or the alignment itself only small accordances can be viewed, which also can be caused by chance. Nevertheless, for 1V8H, which is completely distant to our protein PAH, the RMSD can indicate this distance with a value of 3.25. Looking at LGA and CE sometimes the one, sometimes the other one gives lower RMSDs. CE seems to distinguish a bit better between related and unrelated structures as the differences between the high and low RMSDs are more distinct than in LGA. Neither for TopMatch nor for LGA nor for CE a higher RMSD than six were found for the proteins used in this task. In contrast the RMSD results of SSAP even reached 39.16 for the structure comparison of our protein with the protein which only coincide in class and architecture 2B5U. For very similar structures the RMSD is not as different to the other tools than for distant structures. Still, for the completely unrelated structure 1V8H a RMSD of 7.27 seems to be too low in comparison with the RMSD of 2B5U. Maybe those differences originate from the fact that SSAP uses the Cβ and not the Cα or all atoms for the calculations. Altogether it can be said that it is not easy to evaluate the different tools even if they give the same parameter as output, because different approaches are used to calculate the distances.
Evaluate sequence alignments
Lab journal
In this part the structural alignments are used for the evaluation of the sequence alignments. Therefore we compare the different values of LGA structures with the sequence alignments made by hhsearch.
In <xr id="model_rmsd"/> the results of LGA and hhsearch are shown for eight example proteins.
<figtable id="model_rmsd">
LGA and hhsearch results | |||||||
---|---|---|---|---|---|---|---|
LGA | hhsearch | ||||||
PDB ID | RMSD | LGA_S | LGA_Q | seq_id | probability | e-value | identities(%) |
1PHZ | 0.83 | 90.65 | 32.44 | 99.34 | 100.00 | 6.9e-165 | 92 |
1J8U | 0.73 | 90.29 | 35.83 | 99.67 | 100.00 | 3.1e-135 | 100 |
2V27 | 1.70 | 62.77 | 12.55 | 96.02 | 100.00 | 3.6e-74 | 32 |
2QMX | 3.18 | 7.46 | 1.25 | 4.88 | 98.20 | 1.1e-09 | 36 |
3LUY | 2.82 | 7.17 | 1.24 | 13.89 | 98.07 | 3.3e-09 | 22 |
1QEY | 0.64 | 3.65 | 1.63 | 0.00 | 54.00 | 3.4 | 67 |
1WYP | 2.67 | 8.43 | 1.37 | 0.00 | 29.42 | 15 | 19 |
1A6S | 3.15 | 6.93 | 1.08 | 11.43 | 20.59 | 29 | 36 |
</figtable>
As LGA_S and LGA_Q are used for the calculation of RMSD they have a high dependency on each other and therefore only RMSD is compared with sequence identities and e-values of the sequence alignments. The probabilities for a true relationship are below 30% for the last two example proteins, but very high for the first five example proteins. Especially for probabilities of 100% low e-values and lower RMSDs are found. Only the third one has a bit worse RMSD value, however, with a lower sequence identity. So as there seems to be some connection between the different values, we examine if there are any relations between the e-values or the sequence identities of the hhsearch results and the RMSDs of the LGA calculations. Therefore we used the Pearson correlation coefficient, which shows whether and how two variables are dependent on each other (see also [1]). This calculation of Pearson's correlation coefficient is applied on all 26 proteins:
RMSD against e-value: | 0.4183518 |
RMSD against logarithmic e-value: | 0.7256793 |
RMSD against sequence identity: | -0.8315063 |
Like already seen in <xr id="model_rmsd"/> there seems to be a positive correlation between e-value and RMSD, meaning the lower the e-value the lower the RMSD. However, as the Pearson correlation goes from -1.0 to 1.0 it is not that high with about 0.42. Nevertheless if you take the logarithm of the e-value the correlation is much more pronounced with a value of about 0.73. For sequence identity a strong negative correlation of about -0.83 can be viewed. That a higher sequence similarity results in a more similar structure and therefore in a lower RMSD seems to be reasonable.
References
<references/>