Gaucher Disease: Task 04 - Lab Journal

From Bioinformatikpedia


table.colBasic2 { margin-left: auto; margin-right: auto; border: 1px solid black; border-collapse:collapse; }

.colBasic2 th,td { padding: 3px; border: 1px solid black; }

.colBasic2 td { text-align:left; }

/* for orange try #ff7f00 and #ffaa56 for blue try #005fbf and #aad4ff

maria's style blue: #adceff grey: #efefef

  • /

.colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;}


Exploring Structural Alignments

The structure of the Pyridoxal kinase (2F7K) is not from our set, because we did not have any hits with sequence identity <30% to our reference protein. So we extracted 2F7K randomly from COPS entries with a different L30 group. The sequence identity to the reference structure 1ogs_A is determined by CATH (SSAP).

CATH description of different levels

  • 3.20.20 TIM Barrel
  • 3.20 Alpha-Beta Barrel
  • 3 Alpha Beta
  • 1 mostly Alpha


On the LGA server the PDB IDs of the proteins were used. The method automatically chose chain A.

1. default parameters:

-4 -o2 -gdc

2. with aligned CA atoms:

-4 -o2 -gdc -atom:CA -lga_m

3. with aligned all atoms

-4 -o2 -gdc -ah:0 -lga_m

In the end we decided to use the default parameters (1.), as this results led to better RMSD values than the other parameters (2. und 3.).


In Pymol we aligned each structure of our set, shown in table1, to our disease causing protein structure. For this we only used chain A, as we also used only chain A for calculating the RMSD with other methods. On the other hand the steric configuration of both identical chains are arranged different than two chains of another structure. So, even if each chain has a low RMSD to one chain of our protein, the steric configuration can lead to a high RMSD anyway.

To align all atoms:

align 1ogs_A and resi 1-497, structure2 and resi 1-n

To align all C-alpha atoms:

align 1ogs_A and resi 1-497 and name ca, structure2 and resi 1-n and name ca

whereas n is the sequence length of structure2.


SSAP is the structural alignment method used by CATH. The structures were entered via their PDB ID. In case of several chains, we always used Chain A.


For TopMatch, we used the TopMatch webservice. Only the chains A of the structures are aligned.


SAP is a pairwise protein structure alignment method which uses double dynamic programming. We used the SAP webservice with the second option: "uploading the PDB files". Other possibility is to enter the PDB IDs. There is no possibility to restrain the alignments to a specific chain, therefore the whole structures were used for superposition. However, because the two chains in each protein are identical, duplicate alignments were eliminated implicitly by the program.

Evaluation of structural alignments and sequence alignments

The script requires a .hhr file of HHsearch or HHblits with hit list and alignments as an input. An .hhr file created in task 2 was used, namely by running 2 iterations HHblits against uniprot_20 followed by one iteration against pdb_full with the UniProt sequence of Glucocerebrosidase, P04062, as a query. The number of shown hits and alignments was set to 10000. Otherwise default parameters were used. Overall 220 hits were found, from which the first 82 have E-values lower than 0 and can be regarded as more or less significant. From those hits, we selected the following 7 that span different E-values (lower than 0), scores, alignment lengths and sequence identities:

<figtable id="selected_hits">

No_hit PDB_ID Probability E-value Score Aligned_cols Identities
1 2v3f_A 100.00 2.4e-132 1078.26 497 100%
4 2nt0_A 100.00 4.4e-132 1074.98 496 100%
5 2wnw_A 100.00 3.7e-107 870.15 439 29%
8 3kl0_A 100.00 1.1e-77 633.96 356 19%
17 3s2c_A 98.79 7.7e-13 139.02 246 14%
31 1fob_A 97.76 1.1e-08 101.53 197 16%
82 1ur1_A 95.65 9.8e-05 73.33 175 11%
Selected hits with different E-values, HHblits scores, alignment lengths and sequence identities to the query sequence P04062.


Four of the structures that we used in our set in the "Exploring Structural Alignments" task are found in the hit list and were selected: 2XWD is contained in the cluster of hit 1 and the sequence identical 1OGS, 2NSX and 2NT1 are contained in the cluster of hit 4.

Then, we created models of P04062 using one of the selected hits at a time with as follows (line breaks inserted for a better overview):

perl /usr/share/hhsuite/scripts/ 
-i /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1/hhblits-2iter-uniprot20--1iter-pdb.hhr 
-d /mnt/project/pracstrucfunc13/data/pdb/20120401/entries/* 
-m 1 4 5 8 17 31 82 
-s -39
-ts /mnt/home/student/kalemanovm/master_practical/Assignment4_StructuralAlignments/hhmakemodel/P04062_models.pdb

In the command above, the parameter "-i" is the input .hhr file, "-d" is the directory with the PDB structures of the hits and "-m" is a list of indices of the hits to be used for modelling (default='-m 1'). The parameter "-s" shifts the residue indices up/down by an integer (default=0). This is needed for LGA to work properly, because the residue numbering in the models should be the same as in the experimental structure. As the PDB sequence, 1OGS, starts 39 residues later than the UniProt sequence, P04062, all residues numbers must be reduced by 39. Finally, "-ts" is the file, where the PDB-formatted models based on pairwise alignments should be written.

To make LGA alignment between each model and the reference structure, 1OGS, the models were first divided into separate files with a small script we wrote ( Then, the models were compared to the experimental structure, using LGA (see results). For this, the LGA server with the default parameters was used as described above. This time the input structures were given with the second option: "Upload your local files containing structure data or name the PDB codes". As "Molecule1 to rotate" a model PDB file was loaded and the PDB code with chain was given (for example "3s2c_A"). As "Molecule2 no change" the PDB code of the reference structure, "1ogs_A", was given.

Finally, we calculated Pearson's correlation coefficients between the HHblits and LGA scores with R.


R cor() function

How to compute Pearson's correlation coefficients