Researching SNPs TSD Journal
Contents
HGMD
Sequence retrieval and comparison
Retrieve the protein sequence from HGMD by going to the sequence view and applying switch view, as described in the task description. Download the page manually and save to a file (cdna_new.php). Then execute the following command: <source lang="bash">
grep -P --only-matching "\w{3}
\w{3}" ./cdna_new.php | cut -c99-101 | sed -n '$!p' | tr -d '\n' | tr '[:lower:]' '[:upper:]' |
./ThreeToOneLetterCode.py | tr -d '\n' > hgmdSequence_aa </source> Where 'ThreeToOneLetterCode.py' is the following script <source lang="python">
- !/usr/bin/env python
from Bio.PDB import to_one_letter_code as one_letter import sys
seq = sys.stdin.readline()
for aaa in range(0, len(seq), 3) :
print one_letter[seq[aaa:aaa+3]]
</source> To check whether the protein sequence from HGMD is the same than the one in Uniprot perform the following operations: <source lang="bash"> curl -s http://www.uniprot.org/uniprot/P06865.fasta | sed '1d' | tr -d '\n' > HEXA_HUMAN #Get Uniprot sequence without header perl -p -e 's|(.)|$1\n|g' HEXA_HUMAN > temp1 #Insert linebreak after every character in both sequences, for easy diff'ing perl -p -e 's|(.)|$1\n|g' hgmdSequence_aa > temp2 diff temp1 temp2 </source> This yields that position 436 differs with an Ile in Uniprot and a Val in HGMD.
Mutation retrieval and parsing
Navigate to the subsite with the missense/nonsense mutations and manually save it to 'hgmd_hexa_misssense_nonsense.php'. Then apply the following command: <source lang="bash"> parseHGMDMutations.pl hgmd_hexa_misssense_nonsense.php > missenseSNPHGMD.txt </source> Where 'parseHGMDMutations.pl' is the following script: <source lang="perl">
- !/usr/bin/perl
use strict; use warnings; use sigtrap; use autodie; use Bio::SeqUtils;
my $hgmd_page = shift;
my $fh; open($fh, '<', $hgmd_page); while(my $line = <$fh>) {
#This pattern matching will not find nonsense mutations, therefore already implicitly excluding these
if($line =~ m|<a name="\w{8}">(\w{8})</a>\w+-\w+(\w{3})-(\w{3})(\d+) <a href="http://www.biobase-international.com/product/hgmd"><img src=".+" alt="BioBase" border="0"></a>(.+).+|)
{ my $id = $1; my $aa_wt = $2; my $aa_mt = $3; my $pos = $4; my $pheno = $5;
print $Bio::SeqUtils::ONECODE{$aa_wt} . $pos . $Bio::SeqUtils::ONECODE{$aa_mt} . "\tHGMD\t$id\t$pheno\n"; }
} close($fh); </source>
dbSNP
At first the SNPs were queried with "synonymous-codon"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] in dbSNP. Those results were saved and then the mutations extracted with
bash getdbSNP.sh entrez.html
It is important to note that in the dbSNP results the rs ids can be redundant. getdbSNP.sh: <source lang="bash">
- !/bin/bash
- ${1?"Usage: $0 textfile"}
website=$1
for rs in `egrep -o ">rs[0-9]+<" $website |uniq | egrep -o "rs[0-9]+"`; do
curl -s "http://www.ncbi.nlm.nih.gov/projects/SNP/snp_gene.cgi?rs=$rs" | egrep "aaCode|proteinPos" | cut -d \" -f 4 >> Mut_$website echo "---" >> Mut_$website
done </source>
The mutations were then checked with the reference sequence (!) and brought into the right format:
perl formatSNPs.pl -i Mut_entrez.html -f P06865.fasta
<source lang="perl">
- !usr/bin/perl
use Getopt::Long; our $opt_f = ();
sub usage {
printf "Usage: $0 [flags] file...\n"; printf "flags:\n";
printf " -i Infile raw mutations from getdbSNP.sh \n"; printf " -f Referenzfile fastafile containing referenz sequenz "; printf " -missense Missesne mutations are considered, default = synonymous "; exit 1;
} GetOptions( "i=s", "f=s", "missense" ); usage() if !( $opt_i && $opt_f );
open( REF, "<$opt_f" ) or die "Referenzfile not found";
@reference = (); while (<REF>){ chomp; @line = split( , $_ );
if ($line[0] ne ">") { @reference = (@reference,@line); }
} open (IN, "<$opt_i") or die "Inputfile not found"; %muts = (); while(<IN>){
chomp; if ($_ eq "---"){ if (!$opt_missense){ if ($reference[$pos-1] eq $aa){ print "$aa$pos$aa\n"; } else { print "! not matching reference $reference[$pos-1]: $aa$pos$aa\n"; } } else { foreach my $key ( keys %muts ){ if ($key ne $aa){ if ($reference[$pos-1] eq $aa){ print "$aa$pos$key\n"; } else { print "! not matching reference $reference[$pos-1]: $aa$pos$key\n"; } } } } $count=0; $aa=""; %muts = (); } elsif ($count==0){ $pos=$_; $count++; } elsif ($count>=1){ $muts{$_}=1; $aa=$_; $count++; }
}
</source>
The same procedure can be performed with the missense Mutations from dbSNP with the dbSNP search "missense"[Function_Class] AND HEXA[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS] followed by
bash getdbSNP.sh entrezMissense.html perl formatSNPs.pl -i Mut_entrezMissense.html -f P06865.fasta -missense