Difference between revisions of "Researching SNPs TSD Journal"

From Bioinformatikpedia
(dbSNP)
Line 31: Line 31:
 
== dbSNP ==
 
== dbSNP ==
 
At first the SNPs were queried with "synonymous-codon"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] in [http://www.ncbi.nlm.nih.gov/sites/entrez dbSNP].
 
At first the SNPs were queried with "synonymous-codon"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] in [http://www.ncbi.nlm.nih.gov/sites/entrez dbSNP].
Those results were saved and then the mutations extracted with bash getdbSNP.sh entrez.html.
+
Those results were saved and then the mutations extracted with
  +
bash getdbSNP.sh entrez.html
 
It is important to note that in the dbSNP results the rs ids can be redundant.
 
It is important to note that in the dbSNP results the rs ids can be redundant.
 
getdbSNP.sh:
 
getdbSNP.sh:
Line 46: Line 47:
 
</source>
 
</source>
   
The mutations were then checked with the reference sequence (!) and brought into the right format: perl formatSNPs.pl -i Mut_entrez.html -f P06865.fasta
+
The mutations were then checked with the reference sequence (!) and brought into the right format:
  +
perl formatSNPs.pl -i Mut_entrez.html -f P06865.fasta
 
<source lang="perl">
 
<source lang="perl">
 
#!usr/bin/perl
 
#!usr/bin/perl
Line 52: Line 54:
 
use Getopt::Long;
 
use Getopt::Long;
 
our $opt_f = ();
 
our $opt_f = ();
  +
 
sub usage {
 
sub usage {
 
printf "Usage: $0 [flags] file...\n";
 
printf "Usage: $0 [flags] file...\n";
Line 58: Line 61:
 
printf " -i Infile raw mutations from getdbSNP.sh \n";
 
printf " -i Infile raw mutations from getdbSNP.sh \n";
 
printf " -f Referenzfile fastafile containing referenz sequenz ";
 
printf " -f Referenzfile fastafile containing referenz sequenz ";
  +
printf " -missense Missesne mutations are considered, default = synonymous ";
 
exit 1;
 
exit 1;
 
}
 
}
GetOptions( "i=s", "f=s" );
+
GetOptions( "i=s", "f=s", "missense" );
 
usage() if !( $opt_i && $opt_f );
 
usage() if !( $opt_i && $opt_f );
  +
 
open( REF, "<$opt_f" ) or die "Referenzfile not found";
 
open( REF, "<$opt_f" ) or die "Referenzfile not found";
   
Line 73: Line 78:
 
}
 
}
 
open (IN, "<$opt_i") or die "Inputfile not found";
 
open (IN, "<$opt_i") or die "Inputfile not found";
  +
%muts = ();
 
while(<IN>){
 
while(<IN>){
 
chomp;
 
chomp;
 
if ($_ eq "---"){
 
if ($_ eq "---"){
if ($reference[$pos-1] eq $aa){
+
if (!$opt_missense){
print "$aa$pos$aa\n";
+
if ($reference[$pos-1] eq $aa){
  +
print "$aa$pos$aa\n";
  +
} else {
  +
print "! not matching reference $reference[$pos-1]: $aa$pos$aa\n";
  +
}
 
} else {
 
} else {
print "! not matching reference $reference[$pos-1]: $aa$pos$aa\n";
+
foreach my $key ( keys %muts ){
  +
if ($key ne $aa){
  +
if ($reference[$pos-1] eq $aa){
  +
print "$aa$pos$key\n";
  +
} else {
  +
print "! not matching reference $reference[$pos-1]: $aa$pos$key\n";
  +
}
  +
}
  +
}
 
}
 
}
 
$count=0;
 
$count=0;
 
$aa="";
 
$aa="";
  +
%muts = ();
 
}
 
}
 
elsif ($count==0){
 
elsif ($count==0){
Line 89: Line 108:
 
}
 
}
 
elsif ($count>=1){
 
elsif ($count>=1){
  +
$muts{$_}=1;
 
$aa=$_;
 
$aa=$_;
 
$count++;
 
$count++;
 
}
 
}
 
}
 
}
  +
 
</source>
 
</source>
  +
  +
The same procedure can be performed with the '''missense''' Mutations from dbSNP with
  +
bash getdbSNP.sh entrezMissense.html
  +
perl formatSNPs.pl -i Mut_entrezMissense.html -f P06865.fasta -missense
   
 
== SNPdbe ==
 
== SNPdbe ==

Revision as of 11:56, 9 June 2012

HGMD

Sequence retrieval and comparison

Retrieve the protein sequence from HGMD by going to the sequence view and applying switch view, as described in the task description. Download the page manually and save to a file (cdna_new.php). Then execute the following command: <source lang="bash">

grep -P --only-matching "\w{3}
\w{3}" ./cdna_new.php | cut -c99-101 | sed -n '$!p' | tr -d '\n' | tr '[:lower:]' '[:upper:]' |

./ThreeToOneLetterCode.py | tr -d '\n' > hgmdSequence_aa </source> Where 'ThreeToOneLetterCode.py' is the following script <source lang="python">

  1. !/usr/bin/env python

from Bio.PDB import to_one_letter_code as one_letter import sys

seq = sys.stdin.readline()

for aaa in range(0, len(seq), 3) :

 print one_letter[seq[aaa:aaa+3]]

</source> To check whether the protein sequence from HGMD is the same than the one in Uniprot perform the following operations: <source lang="bash"> curl -s http://www.uniprot.org/uniprot/P06865.fasta | sed '1d' | tr -d '\n' > HEXA_HUMAN #Get Uniprot sequence without header perl -p -e 's|(.)|$1\n|g' HEXA_HUMAN > temp1 #Insert linebreak after every character in both sequences, for easy diff'ing perl -p -e 's|(.)|$1\n|g' hgmdSequence_aa > temp2 diff temp1 temp2 </source>

Mutation retrieval and parsing

Navigate to the subsite with the missense/nonsense mutations and manually save it to 'hgmd_hexa_misssense_nonsense.php'. This yields that position 436 differs with an Ile in Uniprot and a Val in HGMD.

dbSNP

At first the SNPs were queried with "synonymous-codon"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] in dbSNP. Those results were saved and then the mutations extracted with

bash getdbSNP.sh entrez.html

It is important to note that in the dbSNP results the rs ids can be redundant. getdbSNP.sh: <source lang="bash">

  1. !/bin/bash
${1?"Usage: $0 textfile"}

website=$1

for rs in `egrep -o ">rs[0-9]+<" $website |uniq | egrep -o "rs[0-9]+"`; do

       curl -s "http://www.ncbi.nlm.nih.gov/projects/SNP/snp_gene.cgi?rs=$rs" | egrep "aaCode|proteinPos" | cut -d \" -f 4 >> Mut_$website
       echo "---" >> Mut_$website

done </source>

The mutations were then checked with the reference sequence (!) and brought into the right format:

perl formatSNPs.pl -i Mut_entrez.html -f P06865.fasta

<source lang="perl">

  1. !usr/bin/perl

use Getopt::Long; our $opt_f = ();

sub usage {

       printf "Usage: $0 [flags] file...\n";
       printf "flags:\n";
       printf "    -i Infile   raw mutations from getdbSNP.sh \n";
       printf "    -f  Referenzfile fastafile containing referenz sequenz ";
       printf "    -missense  Missesne mutations are considered, default = synonymous ";
       exit 1;

} GetOptions( "i=s", "f=s", "missense" ); usage() if !( $opt_i && $opt_f );

open( REF, "<$opt_f" ) or die "Referenzfile not found";

@reference = (); while (<REF>){ chomp; @line = split( , $_ );

       if ($line[0] ne ">") {
       @reference = (@reference,@line);
       }

} open (IN, "<$opt_i") or die "Inputfile not found"; %muts = (); while(<IN>){

       chomp;
       if ($_ eq "---"){
               if (!$opt_missense){
                       if ($reference[$pos-1] eq $aa){
                               print "$aa$pos$aa\n";
                       } else {
                               print "! not matching reference $reference[$pos-1]:  $aa$pos$aa\n";
                       }
               } else {
                       foreach my $key ( keys %muts ){
                               if ($key ne $aa){
                               if ($reference[$pos-1] eq $aa){
                                       print "$aa$pos$key\n";
                               } else {
                                       print "! not matching reference $reference[$pos-1]:  $aa$pos$key\n";
                               }
                               }
                       }
               }
               $count=0;
               $aa="";
               %muts = ();
       }
       elsif ($count==0){
               $pos=$_;
               $count++;
       }
       elsif ($count>=1){
               $muts{$_}=1;
               $aa=$_;
               $count++;
               }

}

</source>

The same procedure can be performed with the missense Mutations from dbSNP with

bash getdbSNP.sh entrezMissense.html
perl formatSNPs.pl -i Mut_entrezMissense.html -f P06865.fasta -missense

SNPdbe

Mutation map