Gaucher Disease - Task 06 - Lab Journal

From Bioinformatikpedia

1. Multiple sequence alignment

For HRas, we downloaded the full MSA in FASTA format of the Pfam family Ras (PF00071). It contains 21,243 sequences. For the calculation of correlated mutations using freecontact, the MSA (ras.txt) had to be reformatted into the 'aln' format (ras.aln) with /usr/share/freecontact/a2m2aln like this:

/usr/share/freecontact/a2m2aln --query '^RASH_HUMAN/(\d+)' --quiet < ras.txt > ras.aln

The alignment is located here on the student server: /mnt/home/student/kalemanovm/master_practical/Assignment6_Evolutionary_sequence_variation/pfam_Ras_ali/ras.aln

For our protein, we also used the full MSA in FASTA format of the Pfam family Glyco_hydro_30 (PF02055), which contains 1151 sequences. The MSA was reformatted analogously:

/usr/share/freecontact/a2m2aln --query '^GLCM_HUMAN/(\d+)' --quiet < PF02055_full.txt > PF02055_full.aln

The alignment is located here on the student server: /mnt/home/student/kalemanovm/master_practical/Assignment6_Evolutionary_sequence_variation/pfam_Glyco_hydro_30_ali/PF02055_full.aln


2. Calculate and analyze correlated mutations

The following steps below are all described on HRas, but were done for glucocerbrosidase the same. All used programs can be found in this directory /mnt/home/student/gerkej/gaucher/task6/ . The results for each protein in the corresponding subdirectories pfam_Ras/ and P04062/.

1. With the reformatted alignments the residue contacts where predicted with freecontact:

 freecontact --parprof evfold < ras.aln >  ras_contacts.out

2. All pairs with an smaller distance than 5 residues to its sequence neighbors were removed. The remaining pairs were ranked according to their CN values with rank_contacts.py.

3. An analysis of the distribution and range of scores was done by R.

quantile(cn_scores)

4. All pairs of predicted and filtered contacts with a CN>1 were taken as high scoring pairs. These high scoring pairs were checked against the real contacts of the PDB file (HRas: 121p.pdb [1], glucocerebrosidase: 1OGS.pdb). The program distance_check.py needs a file formatted like a freecontact output that contains the high scoring pais, as well as a pdb file of the reference structure.

The program stores the coordinates of all atoms documented in the PDB file. Then it calculates the distance of all high scoring pairs by using the euclidean distance. In case any atoms of two amino acids have a distance less than 5Å the contact is correctly predicted. Otherwise, it is classified as FP. The resulting file contains all information about the high scoring pairs including their states (TP or FP).

For glucocerebrosidase we had to map the positions of the predicted contacts to the residue positions of 1OGS.pdb. The first amino acid of the pdb file has the position 40 in the predicted contacts. We considered this shift in the comparison between predicted and real contacts as well as in the following analyses.

5. To identify a possible correlation a Pearson Correlation was calculated between the scores and their state (TP/FP) in R.

values=data.frame(c(CN,state))
cor(values,method="pearson")

6. All high scoring pairs were visualized in a contact map with R (contact_map.R). First of all, the distances of the PDB file that are lower than 5Å were determined (by pdb_distance_check.py). The pdb contacts which were not predicted as high scoring pairs (FN) are colored lightblue in the contact map. The calculated high scoring pairs and their state can be seen in darkblue (TP) and red (FP).

7. After that the file containing all contacts filtered in step 2. was parsed with calc_hotspot.py to calculate the hotspot residues. The program gets L (sequence length) high scoring pairs with the highest scores as input. From these pairs the program sums the scores of each residue and normalizes it with the average of all top L scores. All normalized residue scores are summarized in hot_spots.txt.

The EVcouplings server was run for both proteins to compare the 50 hot spot. The number of residues predicted as hotspots by both programs were counted to get the overlap of the predictions.


For running EVcouplings with glucocerebrosidase we took the residues 40-536 which correspond to the residues 1-497 in 1OGS.pdb. Another reason for that was the limited sequence size. EVcouplings only calculates domains with size <500 amino acids.

For searching after disease causing mutations of glucocerbrosidase we used the HGMD database.

Calculate Structural Models

The following excercise was only done with HRas. We tried also to run EVfold for our protein, but due to the length of 497 residues the scoring with PLM took to much time. We run EVcouplings for the uniprot sequence P01112 with different scores and different number of contacts. We also included the pdb structure (121.pdb) to improve the prediction and chose the alignment from the Pfam database (PF00071).

Sores:

  • DI
  • PLM

Number of EC constrains:

  • 161 residues (100%)
  • 129 residues (80%)
  • 97 residues (60%)
  • 64 residues (40%)
  • 32 residues (20%)

Configurations EVfold with DI

Configurations EVfold with PLM

Evfold generated five models for every score and constraint combination. For all these models we calculated the RMSD of the C-alpha atoms with pymol. All RMSD were visualizd in a boxplot. We took the average of all five models with the same score and constraint number to compare the performances. Of each score we aligned the best prediction to our orignial structure of HRas 121p.

All RMSDs

model   di_32   di_64  di_97  di_129  di_161
    1   6.815  14.260  4.077   4.246   4.734
    2   5.631   5.201  4.280   5.369  15.169
    3  15.427   4.736  4.158   5.754   5.155
    4   6.360   4.874  3.955   5.286   4.677
    5   7.903   5.130  5.318   4.805   5.568
model   plm32  plm64  plm97  plm129  plm161
    1  14.546  4.625  2.834   2.794   2.926
    2   6.291  3.861  3.303   2.896   3.951
    3   6.885  4.149  3.843   3.254   3.387
    4   7.082  4.326  2.952   2.623   3.165
    5  14.797  4.168  3.255   3.569   3.683