Researching SNPs TSD Journal

From Bioinformatikpedia
Revision as of 12:59, 10 June 2012 by Reeb (talk | contribs) (OMIM)

Back to results.

HGMD

Sequence retrieval and comparison

Retrieve the protein sequence from HGMD by going to the sequence view and applying switch view, as described in the task description. Download the page manually and save to a file (cdna_new.php). Then execute the following command: <source lang="bash">

grep -P --only-matching "\w{3}
\w{3}" ./cdna_new.php | cut -c99-101 | sed -n '$!p' | tr -d '\n' | tr '[:lower:]' '[:upper:]' |

./ThreeToOneLetterCode.py | tr -d '\n' > hgmdSequence_aa </source> Where 'ThreeToOneLetterCode.py' is the following script <source lang="python">

  1. !/usr/bin/env python

from Bio.PDB import to_one_letter_code as one_letter import sys

seq = sys.stdin.readline()

for aaa in range(0, len(seq), 3) :

 print one_letter[seq[aaa:aaa+3]]

</source> To check whether the protein sequence from HGMD is the same than the one in Uniprot perform the following operations: <source lang="bash"> curl -s http://www.uniprot.org/uniprot/P06865.fasta | sed '1d' | tr -d '\n' > HEXA_HUMAN #Get Uniprot sequence without header perl -p -e 's|(.)|$1\n|g' HEXA_HUMAN > temp1 #Insert linebreak after every character in both sequences, for easy diff'ing perl -p -e 's|(.)|$1\n|g' hgmdSequence_aa > temp2 diff temp1 temp2 </source> This yields that position 436 differs with an Ile in Uniprot and a Val in HGMD.

Mutation retrieval and parsing

Navigate to the subsite with the missense/nonsense mutations and manually save it to 'hgmd_hexa_misssense_nonsense.php'. Then apply the following command: <source lang="bash"> parseHGMDMutations.pl hgmd_hexa_misssense_nonsense.php > missenseSNPHGMD.txt </source> Where 'parseHGMDMutations.pl' is the following script: <source lang="perl">

  1. !/usr/bin/perl

use strict; use warnings; use sigtrap; use autodie; use Bio::SeqUtils;

my $hgmd_page = shift;

my $fh; open($fh, '<', $hgmd_page); while(my $line = <$fh>) {

 #This pattern matching will not find nonsense mutations, therefore already implicitly excluding these

if($line =~ m|<a name="\w{8}">(\w{8})</a>\w+-\w+(\w{3})-(\w{3})(\d+) <a href="http://www.biobase-international.com/product/hgmd"><img src=".+" alt="BioBase" border="0"></a>(.+).+|)

 {
   my $id = $1;
   my $aa_wt = $2;
   my $aa_mt = $3;
   my $pos = $4;
   my $pheno = $5;
   print $Bio::SeqUtils::ONECODE{$aa_wt} . $pos . $Bio::SeqUtils::ONECODE{$aa_mt} . "\tHGMD\t$id\t$pheno\n";
 }

} close($fh); </source>

dbSNP

At first the SNPs were queried with "synonymous-codon"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] in dbSNP. Those results were saved and then the mutations extracted with

bash getdbSNP.sh entrez.html

It is important to note that in the dbSNP results the rs ids can be redundant. getdbSNP.sh: <source lang="bash">

  1. !/bin/bash
${1?"Usage: $0 textfile"}

website=$1

rm -f Mut_$website

for rs in `egrep -o ">rs[0-9]+<" $website |uniq | egrep -o "rs[0-9]+"`; do echo $rs >> Mut_$website curl -s "http://www.ncbi.nlm.nih.gov/projects/SNP/snp_gene.cgi?rs=$rs" | egrep "aaCode|proteinPos" | cut -d \" -f 4 >> Mut_$website echo "---" >> Mut_$website done </source>

The mutations were then checked with the reference sequence (!) and brought into the right format:

perl formatSNPs.pl -i Mut_entrez.html -f P06865.fasta

<source lang="perl">

  1. !usr/bin/perl

use Getopt::Long; our $opt_f = (); our $opt_d = "dbSNP"; sub usage {

       printf "Usage: $0 [flags] file...\n";
       printf "flags:\n";
       printf "    -i Infile   raw mutations from getdbSNP.sh \n";
       printf "    -f  Referenzfile fastafile containing referenz sequenz ";
       printf "    -missense  Missesne mutations are considered, default = synonymous ";
       printf "    -d name of database to write in output file, default = dbSNP ";
       exit 1;

} GetOptions( "i=s", "f=s", "missense", "d=s" ); usage() if !( $opt_i && $opt_f );

open( REF, "<$opt_f" ) or die "Referenzfile not found";

@reference = (); while (<REF>){ chomp; @line = split( , $_ ); if ($line[0] ne ">") { @reference = (@reference,@line); } }

open (IN, "<$opt_i") or die "Inputfile not found"; %muts = (); my $rs; while(<IN>){ chomp; if ($_ eq "---"){ if (!$opt_missense){ if ($reference[$pos-1] eq $aa){ print "$aa$pos$aa\t$opt_d\t$rs\t\t". (($opt_missense) ? 'missense' : 'synonymous') ."\n"; } else { print "! not matching reference $reference[$pos-1]: $aa$pos$aa\n"; } } else { foreach my $key ( keys %muts ){ if ($key ne $aa){ if ($reference[$pos-1] eq $aa){ print "$aa$pos$key\t$opt_d\t$rs\t\t". (($opt_missense) ? 'missense' : 'synonymous') . "\n"; } else { print "! not matching reference $reference[$pos-1]: $aa$pos$key\n"; } } } } $count=0; $aa=""; %muts = ();

       $rs = ;

} elsif ($count==0){ $rs=$_; $count++; } elsif ($count == 1)

   {
     $pos = $_;
     $count++;
   }

elsif ($count>=2){ $muts{$_}=1; $aa=$_; $count++; } } </source>

The same procedure can be performed with the missense Mutations from dbSNP with the dbSNP search "missense"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] followed by

bash getdbSNP.sh entrezMissense.html
perl formatSNPs.pl -i Mut_entrezMissense.html -f P06865.fasta -missense

SNPdbe

The results are provided on the webpage in a tabular format, where a lot of information can be obtained very easily just by sorting the columns (e.g. conservation scores, exp. evidence). A text file with the results snps.txt can also be downloaded directly from the webpage. Within the text file the fields are separated by "|". In the first field the mutations is displayed. To format the SNPs according to the previously used conventions SNP ORIGIN ID this script can be used.

<source lang="perl">

  1. !usr/bin/perl

open(IN,"<$ARGV[0]") or die "no input file"; while (<IN>){

       $sep="";
       $iddbsnp="";
       $idswissvar="";
       $Dsep="";
       @line = split('\|', $_);
       print $line[0]."\tsnpdbe: ";
       if ($line[1] ne "N/A"){
               print "dbSNP";
               $sep=",";
               $Dsep=",";
               $iddbsnp = $line[1];
       }
       if ($line[2] ne "N/A"){
               print $sep."SwissProt";
               $sep=",";
       }
       if ($line[3] ne "N/A"){
               print $sep."SwissVar";
               $sep=",";
               $idswissvar = $line[3];
       } else { $Dsep="";}
       if ($line[4] ne "N/A"){
               print $sep."PMD";
               $sep=",";
       }
       if ($line[5] ne "N/A"){
               print $sep."1000genomes";
       }
       print "\t $iddbsnp$Dsep$idswissvar \n";

} </source>

The conservation scores are stored in the field 21 and 24, perc_wt and perc_mt:

cut -d "|" -f 1,22,25 snps.txt | sed s/\|/" "/g

OMIM

Retrieve all entries that have rs-Identifiers: <source lang="bash"> curl -s http://omim.org/allelicVariant/606869 | grep -P --only-matching "target=\"_blank\">rs\d+</a>" | sed 's/\s//' | grep -P --only-matching "rs\d+" > rs_OMIM.txt </source>

SNPedia

Retrieve all rs-identifiers from SNPedia (and already exclude one entry without protein information), using the following commandline: <source lang="bash"> curl -s http://www.snpedia.com/index.php\?title\=Special:Search\&ns0\=1\&redirs\=0\&search\=Gene%3DHEXA\&limit\=500\&offset\=0 | grep -P --only-matching "title=\"Rs\d+\"" | sed 's|title=||' | tr -d '"' | tr '[:upper:]' '[:lower:]' | grep -v 'rs121907980' > rs_SNPedia.txt </source> Then get more information on the rs-identifiers from dbSNP with the following comamndline <source lang="bash"> ./getSNPData.sh rs_SNPedia.txt </source> where getSNPData.sh ist the following script; <source lang="bash">

  1. !/usr/bin/bash

in=$1 rm -f enriched_syn_$in rm -f enriched_mis_$in

for rs in `cat $in`; do

 curl -s "http://www.ncbi.nlm.nih.gov/projects/SNP/snp_gene.cgi?rs=$rs"  > temp
 if [ `egrep 'missense' ./temp |wc -l` -gt 0 ]
 then
   echo $rs >> enriched_mis_$in
   cat temp | egrep "aaCode|proteinPos" | cut -d \" -f 4 > temp2
   #In theory, do special parsing of entries that consist of two or more mutations. Luckily for our case there are none ;)
   cat temp2 >> enriched_mis_$in
   echo "---" >> enriched_mis_$in
 else
     if [ `egrep 'synon' ./temp |wc -l` -gt 0 ]
     then
       echo $rs >> enriched_syn_$in
       cat temp | egrep "aaCode|proteinPos" | cut -d \" -f 4 > temp2
       #In theory, do special parsing of entries that consist of two or more mutations. Luckily for our case there are none ;)
       cat temp2 >> enriched_syn_$in
       echo "---" >> enriched_syn_$in
     else
       echo "Ignoring $rs because neither missense nor synon"
     fi
   fi

done

rm -f ./temp rm -f ./temp2 </source> With the output of this, we can use the script TODO from above to get the final formatted SNPs: <source lang="bash"> perl ../dbSNP/formatSNPs.pl -i ./enriched_mis_rs_SNPedia.txt -f ../dbSNP/P06865.fasta -missense -d SNPedia > missenseSNPSNPedia.txt perl ../dbSNP/formatSNPs.pl -i ./enriched_syn_rs_SNPedia.txt -f ../dbSNP/P06865.fasta -d SNPedia > > synonymousSNPSNPedia.txt </source>

Overlap to dbSNP

<source lang="bash"> cut -f 3 missenseSNPSNPedia.txt > temp1 cut -f 3 synonymousSNPSNPedia.txt > temp2 cat temp1 temp2 > tempc cat ../dbSNP/synonymous_rsids.txt ../dbSNP/missense_rsids.txt > dbsnpc grep -Fx -f tempc dbsnpc | wc -l #Yields: 30 </source>

Mutation map