Difference between revisions of "Gaucher Disease - Task 06 - Lab Journal"

From Bioinformatikpedia
(2. Calculate and analyze correlated mutations)
(2. Calculate and analyze correlated mutations)
Line 49: Line 49:
 
The program stores the coordinates of all atoms documented in the pdb file. Then it calculates the distance of all high scoring pairs by using the euclidean distance. In case any atoms of two amino acids have a distance less than 5A the contact is right predicted. Otherwise, it is classified as FP. The resulting file contains all information about the high scoring pairs including its state (TP or FP).
 
The program stores the coordinates of all atoms documented in the pdb file. Then it calculates the distance of all high scoring pairs by using the euclidean distance. In case any atoms of two amino acids have a distance less than 5A the contact is right predicted. Otherwise, it is classified as FP. The resulting file contains all information about the high scoring pairs including its state (TP or FP).
   
5. All high scoring pairs were visualised in a contact map with R. First of all, the distances of the pdb file that are lower than 5Å were determined (by <code></code>). The pdb contacts which were not predicted as high scoring pairs (FN) are coloured lightblue in the contact map. The calculated high scoring pairs and their state can be seen in darkblue (TP) and red (FP).
+
5. All high scoring pairs were visualised in a contact map with R (<code>contact_map.R</code>). First of all, the distances of the pdb file that are lower than 5Å were determined (by <code>/pdb_distance_check.py</code>). The pdb contacts which were not predicted as high scoring pairs (FN) are coloured lightblue in the contact map. The calculated high scoring pairs and their state can be seen in darkblue (TP) and red (FP).
   
 
6. After that the contact.out file was parsed with <code>calc_hotspot.py</code> to calculate the hotspot residues.
 
6. After that the contact.out file was parsed with <code>calc_hotspot.py</code> to calculate the hotspot residues.

Revision as of 11:53, 24 August 2013

Multiple sequence alignment

For HRas, we downloaded the full MSA in FASTA format (=a2m?) from Pfam: Ras (PF00071). It contains 21,243 sequences. For the calculation of correlated mutations using freecontact, the MSA (ras.txt) had to be reformatted (to ras.aln) with /usr/share/freecontact/a2m2aln like this:

/usr/share/freecontact/a2m2aln --query '^RASH_HUMAN/(\d+)' --quiet < ras.txt > ras.aln

The alignments can be found in: /mnt/home/student/kalemanovm/master_practical/Assignment6_Evolutionary_sequence_variation/pfam_Ras_ali/

For our protein, the Pfam alignment of the only family found, Glyco_hydro_30 (PF02055), contained only 1151 sequences, which is not enough for freecontact. Therefore, we used own alignments from task 2, as HHblits found 17,538 hit with default E-value cutoff and 3,189 hits with E-value cutoff of 10E-1 after 10 iterations (actually after 5). We took both alignment to compare the results, as the search with default E-value on the one hand has generated more alignmnets, which is an advantage for freecontact, but on the other hand some of those hits could have been false positives, as discussed.

2. Calculate and analyze correlated mutations

The following steps below are all described on HRas, but were done for Glucocerbrosidase the same. All used programs can be found in this directory /mnt/home/student/gerkej/gaucher/task6/ . The results for each protein in the corresponding subdirectories pfam_Ras/ and P04062/.

1. With the reformatted alignments the residue contacts where predicted with freecontact:

 freecontact --parprof evfold < ras.aln >  ras_contacts.out

2. All pairs with an smaller distance than 5 residues to its sequence neighbours were removed. The remaining pairs were ranked to its CN values.

python rank_contacts.py ras_contacts.out filtered_ras_contact.out

3. An Analysed of its distribution and range of scores was done by R.

4. All pairs of predicted and filtered contacts with a CN>1 were taken as high scoring pairs. These high scoring pairs were checked against the real contacts of the pdb file (HRas: 121p.pdb [1], Glucocerebrosidase: 1OGS.pdb).

python distance_check.py filtered_ras_contact.out 121p.pdb

The program stores the coordinates of all atoms documented in the pdb file. Then it calculates the distance of all high scoring pairs by using the euclidean distance. In case any atoms of two amino acids have a distance less than 5A the contact is right predicted. Otherwise, it is classified as FP. The resulting file contains all information about the high scoring pairs including its state (TP or FP).

5. All high scoring pairs were visualised in a contact map with R (contact_map.R). First of all, the distances of the pdb file that are lower than 5Å were determined (by /pdb_distance_check.py). The pdb contacts which were not predicted as high scoring pairs (FN) are coloured lightblue in the contact map. The calculated high scoring pairs and their state can be seen in darkblue (TP) and red (FP).

6. After that the contact.out file was parsed with calc_hotspot.py to calculate the hotspot residues. The script can be found in the directory /mnt/home/student/gerkej/gaucher/task6/, the contact.out files are stored in the subdirectories pfam_Ras/ and P04062/