Task 5: Mapping point mutations

From Bioinformatikpedia
Revision as of 10:20, 20 June 2011 by Meier (talk | contribs) (HGMD)

Task description

A detailed task description can be found here: Mapping point mutations

SNP databases

HGMD

  • HGMD
  • Searched for PAH
  • 429 Missense/Nonsense mutations known by HGMD Professional

There are several mutation types known for PAH:

  • Missense - A single nucleotide point mutation in a codon, such that the resulting amino acid changes
  • Nonsense - A single nucleotide point mutation in a codon, such that the resulting codon represents a polymerase stop signal
  • Splicing - A mutation, which influences the splicing of the gene
  • Regulatory - A mutation, which influences the regulation of the gene
  • Small/Gross deletions - A mutation, which deletes some/more nucleotides in the gene
  • Small/Gross insertions - A mustation, which inserts some/more nucleotides in the gene
  • Small indels - A deletion followed by an insertion after the nucleotides affected
  • Gross duplications - A mutation, which results in the copy of a piece of the DNA
  • Complex rearrangements - A mutation, which results in a changed order of the sequence parts of a gene

One additional category of mutation is known, but is not recorded for PAH

  • Repeat variations - A mutation, which affects a repeated sequence in the gene

Reference Sequence

The reference sequence is given by the accession number NM_000277.1, whose entry contains the following amino acid sequence:

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEEN DVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDI GATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ FADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCG FHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPM YTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLC KQGDSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESF NDAKEKVRNFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK IK

SNPs

dbSNP

Clarifications on insertions and deletions

The impact of insertions and deletions in coding regions on chromosomes can be differently. Fatal for a protein are insertions or deletions which introduce a frame shift. This happens when the length of the insertion or deletion is not divisible by 3 without producing a rest. Or in other words: Len mod 3 != 0 . Where Len is the length of the insertion or deletion.

Methodology

Retrieving Data
Figure 1: Shows the query and result for our silent mutation search

We searched in dbSNP for silent mutations in coding regions. This means we only considered those SNPs which alter the triplet but not the amino acid.

To do so we used the Entrez interface of NCBI which is accessible under this URL:

  •  ://www.ncbi.nlm.nih.gov/sites/entrez?Db=snp

The advantage of this Entrez interface is that we can construct arbitrary complex queries to restrict our result set.

We constructed the following query to search for SNPs which are considered silent in the coding regions of the human PAH gene (see figure 1):

  • "synonymous-codon"[Function_Class] AND PAH[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS]

Results of this query can be accessed directly via the following URL:

We decided to download the results as FlatFile. This seemed to be the most simple format to process and contains almost all information we need.

Processing Data

A self written Perl script is helping us to parse the important information out of the FlatFile we downloaded in the previous step. In the current version the following information is parsed out of the FlatFile: identifier, triplet reference/mutated, allele reference/mutated , frame of the mutation, residue reference/mutated and residue position.

All annotations we retrieved are annotated for the mRNA sequence with the GenBank accession number NM_000277.1 which contains the coding sequences of our PAH protein with the accession number NP_000268.1 which is exactly the same protein and therefore has also the same sequence as the UniProt entry for PAH with the accession number P00439. Thus, no mapping to other residue coordinates was required to map these mutations to our mutation map.

With our perl script at hand (the code is accessible here: dbSNP Silent Mutations Parser) and the results in FlatFile format, we are only missing the CDS sequence of NM_000277.1 in order to get the used triplet for each residue. This CDS sequence can be found in the CCDS database of NCBI with the accession number CCDS9092.1.

Now, with all data at hand we can run our perl script in different ways to retrieve different outputs for different purposes. Currently the following three outputs with the following commands can be generated:

  • Generates a WikiTable of found silent mutations: ./parse.pl wiki snp_result.txt ccds_pah.fasta
  • Generates CSVs which can be used by our mapping tool to generate the mutation map: ./parse.pl map snp_result.txt ccds_pah.fasta
  • Generates a human readable one liner for each silent mutation: ./parse.pl list snp_result.txt ccds_pah.fasta

Results

We could find the following silent mutations in dbSNP:

Identifier AA-Position Reference Triplet Mutated Triplet Reference Allele Mutated Allele Frame Reference Residue Mutated Residue
rs117308669 66 GAA GAG A G 3 E E
rs75065106 258 CTG TTG C T 1 L L
rs62651567 323 ACA ACG A G 3 T T
rs62508648 367 CTG CTA G A 3 L L
rs61747292 321 CTC CTT C T 3 L L
rs59326968 426 AAT AAC T C 3 N N
rs17852374 36 TCA TCG A G 3 S S
rs1801152 414 TAC TAT C T 3 Y Y
rs1801151 400 AGG CGG A C 1 R R
rs1801150 399 GTA GTT A T 3 V V
rs1801147 203 TGC TGT C T 3 C C
rs1801146 137 AGC AGT C T 3 S S
rs1801145 10 GGC GGG C G 3 G G
rs1126758 232 CAG CAG A G 3 Q Q
rs1042503 245 GTG GTA G A 3 V V
rs772897 385 CTG CTC G C 3 L L

Comparing the annotation of HGMD and SNPdb

Alignment of the reference sequences

We decided to use the sequence of PAH of Uniprot (see UniProt).

  • MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
    NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
    FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
    EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
    RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
    QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
    KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
    IEVLDNTQQLKILADSINSEIGILCSALQKIK

Alignment with the reference sequence used in HGMD

The resulting alignment shows a 100% identity without any gaps. Therefore it is a "self-alignment".

Alignment with the reference sequence used in SNPdb

Mapping

Discussion

dbSNP

At first, we were quite surprised that we found only 16 silent mutation for the PAH gene. Normally, silent mutations are expected to appear more frequent in coding regions than missense/nonsense mutations because they are not subject to positive selection pressure. However, we assume that the few known silent mutations in PAH is probably a result of lack of data.

In a next step we analyzed the frequencies of the different possible allele mutations with the following results:

Allele Mutation Absolute Frequency Relative Frequency
A->T 1 0.0625
A->C 1 0.0625
A->G 4 0.25
T->A 0 0
T->C 1 0.0625
T->G 0 0
C->A 0 0
C->T 5 0.3125
C->G 1 0.0625
G->A 2 0.125
G->T 0 0
G->C 1 0.0625
Purine->Purine 6 0.375
Purine->Pyrimidine 3 0.1875
Pyrimidine->Pyrimidine 6 0.375
Pyrimidine->Purine 1 0.0625

We observed that C -> T and A -> G are the most frequent mutations with a relative frequency of 0.3125 and 0.25 respectively. The first most frequent mutation (C -> T) is a mutation from a pyrimidine base (C) to another pyrimidine base (T) and the second most frequent mutation is a mutation from a purine base (A) to another purine base (G). Also, in general purine -> purine and pyrimidine -> pyrimidine mutations are the most frequent ones with a relative frequency of 0.375 and 0.375 respectively.

This observation somehow reflects our expectations. Since it is more likely that a base gets wrongly incorporated during replication with a base of the same type. The reason for this is that the DNA-polymerase is differentiating the different nucleobase by shape. Adenine and guanine have almost the same shape since they have two rings. The same applies fro cytosine and thymine which also hardly differ regarding to their shape (both have only one ring). Hence, it is more likely for a DNA-polymerase to mistake a purine base with another purine base and to mistake a pyrimidine base with another pyrimidine base than to mistake a purine base with a pyrimidine and vice versa.