Difference between revisions of "Lab Journal - Task 6 (PAH)"

From Bioinformatikpedia
(Multiple Sequence Alignment)
(Calculate and analyze correlated mutations)
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
All files are stored in /mnt/home/student/waldraffs/masterpractical/Task6
  +
*/ras: contains all files for H-RAS
  +
*/pah: contains all files for PAH
 
== Multiple Sequence Alignment ==
 
== Multiple Sequence Alignment ==
 
The multiple alignments are downloaded from the PFAM server and are converted into a freecontact readable format using a2m2aln.
 
The multiple alignments are downloaded from the PFAM server and are converted into a freecontact readable format using a2m2aln.
 
#Protein H-RAS: <br> <code>/usr/share/freecontact/a2m2aln -q '^RASH_HUMAN/(\d+)' --quiet < PF00071_full.txt > PF00071.aln </code>
 
#Protein H-RAS: <br> <code>/usr/share/freecontact/a2m2aln -q '^RASH_HUMAN/(\d+)' --quiet < PF00071_full.txt > PF00071.aln </code>
#For our protein PAH, we have two domains. We used the PFAM alignment of the biopterin domain:<br><code> /usr/share/freecontact/a2m2aln -q '^PH4H_HUMAN/(\d+)' --quiet < PF00351_full.txt > PF00351.aln</code>
+
#For our protein PAH, we have two domains (ACT: PF01842, Biopterin: PF00351) and therefore used the hhblits result of Task2. The .a3m file is converted into stockholm format using<br><code>perl /usr/share/hhsuite/scripts/reformat.pl a3m sto PAH_2000.a3m PAH_2000.stockholm</code>
  +
<br> After that the header is changed into ''# query="'' and positions that have a gap in the query sequences are removed: PAH.aln.
   
 
== Calculate and analyze correlated mutations ==
 
== Calculate and analyze correlated mutations ==
#''Freecontact'' is used to calculate CN-score for the multiple alignments:<br><code>freecontact -o evfold < '<PFAM-ID>.aln' > <PFAM-ID>.evfold</code>
+
#''Freecontact'' is used to calculate CN-score for the multiple alignments:<br><code>freecontact -o evfold < '<FILE>.aln' > <FILE>.evfold</code>
#''contact_map.pl'' extracts all residue pairs with less than 5 Ångstrom minimum atom distance:<br><code>perl contact_map.pl -pdb <pdb-file> -out <output-file></code>
+
#[https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task6_Scripts#contact_map.pl ''contact_map.pl''] extracts all residue pairs with less than 5 Ångstrom minimum atom distance:<br><code>perl contact_map.pl -pdb <pdb-file> -out <output-file></code>
#''extract_pairs.pl'' extracts all residue pairs with distance >5, if such a pair also is included in the output of contact_map.pl it is marked with 'TP' (true positive) else with 'FP' (false positive):<br><code>perl extract_pairs.pl -inp <PFAM-ID>.evfold -map <contact_map.pl output-file> -out <output-file></code>
+
#[https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task6_Scripts#extract_pairs.pl ''extract_pairs.pl''] extracts all residue pairs with distance >5, if such a pair also is included in the output of contact_map.pl it is marked with 'TP' (true positive) else with 'FP' (false positive):<br><code>perl extract_pairs.pl -inp <FILE>.evfold -map <contact_map.pl output-file> -out <output-file></code>
#the results are sorted (CN-score descending) for both all and extracted residue pairs: <br><code>sort -k 6 -g -r <PFAM-ID>.evfold >sort_<PFAM-ID>.txt</code>
+
#the results are sorted (CN-score descending) for both all and extracted residue pairs: <br><code>sort -k 6 -g -r <FILE> >sort_<FILE></code>
#''CN_dist2.R'' makes histograms for the CN-Score distribution (for all and extracted pairs). Furthermore it calculates the top L-Score (L = protein length) for each residue i that belongs to the top L:<br><code>top L-Score(i) = (sum of CN scores for residue i)/mean(CN-Scores of top L)</code>
+
#[https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task6_Scripts#CN_dist2.R ''CN_dist2.R''] makes histograms for the CN-Score distribution (for all and extracted pairs). Furthermore it calculates the top L-Score (L = protein length) for each residue i that belongs to the top L:<br><code>top L-Score(i) = (sum of CN scores for residue i)/mean(CN-Scores of top L)</code>
#''contact_map.R'' creates a contact map with the output-files of the two perl scripts above (pdb = reference structure, extracted = predicted).
+
#[https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task6_Scripts#contact_map.R ''contact_map.R''] creates a contact map with the output-files of the two perl scripts above (pdb = reference structure, extracted = predicted).
 
#Evcouplings<br>Reference structure for Ras is [http://www.rcsb.org/pdb/explore/explore.do?structureId=121p 121p].<br>For the biopterin family we have to set the starting position to 106 to get a multiple alignment.
 
#Evcouplings<br>Reference structure for Ras is [http://www.rcsb.org/pdb/explore/explore.do?structureId=121p 121p].<br>For the biopterin family we have to set the starting position to 106 to get a multiple alignment.
   
The perl and R scripts can be found in <code>/mnt/home/student/waldraffs/masterpractical/Task6</code>.
+
The perl and R scripts can be found in <code>/mnt/home/student/waldraffs/masterpractical/Task06</code>.
   
 
== Calculate structural model ==
 
== Calculate structural model ==

Latest revision as of 15:07, 17 August 2013

All files are stored in /mnt/home/student/waldraffs/masterpractical/Task6

  • /ras: contains all files for H-RAS
  • /pah: contains all files for PAH

Multiple Sequence Alignment

The multiple alignments are downloaded from the PFAM server and are converted into a freecontact readable format using a2m2aln.

  1. Protein H-RAS:
    /usr/share/freecontact/a2m2aln -q '^RASH_HUMAN/(\d+)' --quiet < PF00071_full.txt > PF00071.aln
  2. For our protein PAH, we have two domains (ACT: PF01842, Biopterin: PF00351) and therefore used the hhblits result of Task2. The .a3m file is converted into stockholm format using
    perl /usr/share/hhsuite/scripts/reformat.pl a3m sto PAH_2000.a3m PAH_2000.stockholm


After that the header is changed into # query=" and positions that have a gap in the query sequences are removed: PAH.aln.

Calculate and analyze correlated mutations

  1. Freecontact is used to calculate CN-score for the multiple alignments:
    freecontact -o evfold < '<FILE>.aln' > <FILE>.evfold
  2. contact_map.pl extracts all residue pairs with less than 5 Ångstrom minimum atom distance:
    perl contact_map.pl -pdb <pdb-file> -out <output-file>
  3. extract_pairs.pl extracts all residue pairs with distance >5, if such a pair also is included in the output of contact_map.pl it is marked with 'TP' (true positive) else with 'FP' (false positive):
    perl extract_pairs.pl -inp <FILE>.evfold -map <contact_map.pl output-file> -out <output-file>
  4. the results are sorted (CN-score descending) for both all and extracted residue pairs:
    sort -k 6 -g -r <FILE> >sort_<FILE>
  5. CN_dist2.R makes histograms for the CN-Score distribution (for all and extracted pairs). Furthermore it calculates the top L-Score (L = protein length) for each residue i that belongs to the top L:
    top L-Score(i) = (sum of CN scores for residue i)/mean(CN-Scores of top L)
  6. contact_map.R creates a contact map with the output-files of the two perl scripts above (pdb = reference structure, extracted = predicted).
  7. Evcouplings
    Reference structure for Ras is 121p.
    For the biopterin family we have to set the starting position to 106 to get a multiple alignment.

The perl and R scripts can be found in /mnt/home/student/waldraffs/masterpractical/Task06.

Calculate structural model

The length of Pfam alignment of H-Ras is 160, therefore we take following number of contacts: 64, 104, 160.
For biopterin the protein length is 346 as we only make an alignment with amino acids 106 to 452. So we take 138, 225 and 346 as number of contacts.