Difference between revisions of "Task 5: Mapping point mutations"

From Bioinformatikpedia
(Methodology)
(Results)
Line 84: Line 84:
 
|-
 
|-
 
| rs117308669
 
| rs117308669
| 65
+
| 66
 
| GAA
 
| GAA
 
| GAG
 
| GAG
Line 94: Line 94:
 
|-
 
|-
 
| rs75065106
 
| rs75065106
| 257
+
| 258
 
| CTG
 
| CTG
 
| TTG
 
| TTG
Line 104: Line 104:
 
|-
 
|-
 
| rs62651567
 
| rs62651567
| 322
+
| 323
 
| ACA
 
| ACA
 
| ACG
 
| ACG
Line 114: Line 114:
 
|-
 
|-
 
| rs62508648
 
| rs62508648
| 366
+
| 367
 
| CTG
 
| CTG
 
| CTA
 
| CTA
Line 124: Line 124:
 
|-
 
|-
 
| rs61747292
 
| rs61747292
| 320
+
| 321
 
| CTC
 
| CTC
 
| CTT
 
| CTT
Line 134: Line 134:
 
|-
 
|-
 
| rs59326968
 
| rs59326968
| 425
+
| 426
 
| AAT
 
| AAT
 
| AAC
 
| AAC
Line 144: Line 144:
 
|-
 
|-
 
| rs17852374
 
| rs17852374
| 35
+
| 36
 
| TCA
 
| TCA
 
| TCG
 
| TCG
Line 154: Line 154:
 
|-
 
|-
 
| rs1801152
 
| rs1801152
| 413
+
| 414
 
| TAC
 
| TAC
 
| TAT
 
| TAT
Line 164: Line 164:
 
|-
 
|-
 
| rs1801151
 
| rs1801151
| 399
+
| 400
 
| AGG
 
| AGG
 
| CGG
 
| CGG
Line 174: Line 174:
 
|-
 
|-
 
| rs1801150
 
| rs1801150
| 398
+
| 399
 
| GTA
 
| GTA
 
| GTT
 
| GTT
Line 184: Line 184:
 
|-
 
|-
 
| rs1801147
 
| rs1801147
| 202
+
| 203
 
| TGC
 
| TGC
 
| TGT
 
| TGT
Line 194: Line 194:
 
|-
 
|-
 
| rs1801146
 
| rs1801146
| 136
+
| 137
 
| AGC
 
| AGC
 
| AGT
 
| AGT
Line 204: Line 204:
 
|-
 
|-
 
| rs1801145
 
| rs1801145
| 9
+
| 10
 
| GGC
 
| GGC
 
| GGG
 
| GGG
Line 214: Line 214:
 
|-
 
|-
 
| rs1126758
 
| rs1126758
| 231
+
| 232
 
| CAG
 
| CAG
 
| CAG
 
| CAG
Line 224: Line 224:
 
|-
 
|-
 
| rs1042503
 
| rs1042503
| 244
+
| 245
 
| GTG
 
| GTG
 
| GTA
 
| GTA
Line 234: Line 234:
 
|-
 
|-
 
| rs772897
 
| rs772897
| 384
+
| 385
 
| CTG
 
| CTG
 
| CTC
 
| CTC

Revision as of 22:46, 16 June 2011

Task description

A detailed task description can be found here: Mapping point mutations

SNP databases

HGMD

  • HGMD
  • Searched for PAH
  • 429 Missense/Nonsense mutations known by HGMD Professional

There are several mutation types known for PAH:

  • Missense/nonsense
  • Splicing
  • Regulatory
  • Small deletions
  • Small insertions
  • Small indels
  • Gross deletions
  • Gross insertions/duplications
  • Complex rearrangements

One additional category of mutation is known, but is not recorded for PAH

  • Repeat variations

Reference Sequence

The reference sequence is given by the accession number NM_000277.1, whose entry contains the following amino acid sequence:

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEEN DVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDI GATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ FADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCG FHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPM YTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLC KQGDSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESF NDAKEKVRNFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK IK

SNPs

SNPdb

Methodology

Retrieving Data
Figure 1: Shows the query and result for our silent mutation search

We searched in dbSNP for silent mutations in coding regions. This means we only considered those SNPs which alter the triplet but not the amino acid.

To do so we used the Entrez interface of NCBI which is accessible under this URL:

  •  ://www.ncbi.nlm.nih.gov/sites/entrez?Db=snp

The advantage of this Entrez interface is that we can construct arbitrary complex queries to restrict our result set.

We constructed the following query to search for SNPs which are considered silent in the coding regions of the human PAH gene (see figure 1):

  • "synonymous-codon"[Function_Class] AND PAH[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS]

Results of this query can be accessed directly via the following URL:

We decided to download the results as FlatFile. This seemed to be the most simple format to process and contains almost all information we need.

Processing Data

To parse the important information out of the FlatFile we downloaded in the previous step we wrote a Perl script. In the current version the following information is parsed out of the FlatFile: identifier, triplet reference/mutated, allele reference/mutated , frame of the mutation, residue reference/mutated and residue position.

All annotations we retrieved are annotated for the mRNA sequence with the GenBank accession number NM_000277.1 which contains the coding sequences of our PAH protein with the accession number NP_000268.1

Results

We could find the following silent mutations in dbSNP:

Identifier AA-Position Reference Triplet Mutated Triplet Reference Allele Mutated Allele Frame Reference Residue Mutated Residue
rs117308669 66 GAA GAG A G 3 E E
rs75065106 258 CTG TTG C T 1 L L
rs62651567 323 ACA ACG A G 3 T T
rs62508648 367 CTG CTA G A 3 L L
rs61747292 321 CTC CTT C T 3 L L
rs59326968 426 AAT AAC T C 3 N N
rs17852374 36 TCA TCG A G 3 S S
rs1801152 414 TAC TAT C T 3 Y Y
rs1801151 400 AGG CGG A C 1 R R
rs1801150 399 GTA GTT A T 3 V V
rs1801147 203 TGC TGT C T 3 C C
rs1801146 137 AGC AGT C T 3 S S
rs1801145 10 GGC GGG C G 3 G G
rs1126758 232 CAG CAG A G 3 Q Q
rs1042503 245 GTG GTA G A 3 V V
rs772897 385 CTG CTC G C 3 L L

Comparing the annotation of HGMD and SNPdb

Alignment of the reference sequences

We decided to use the sequence of PAH of Uniprot (see UniProt).

  • MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
    NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
    FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
    EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
    RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
    QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
    KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
    IEVLDNTQQLKILADSINSEIGILCSALQKIK

Alignment with the reference sequence used in HGMD

The resulting alignment shows a 100% identity without any gaps. Therefore it is a "self-alignment".

Alignment with the reference sequence used in SNPdb

Mapping

Discussion