Difference between revisions of "Task 5: Mapping point mutations"

Revision as of 11:42, 19 June 2011

1 Task description
2 SNP databases
- 2.1 HGMD
  - 2.1.1 Reference Sequence
  - 2.1.2 SNPs
- 2.2 dbSNP
3 Comparing the annotation of HGMD and SNPdb
- 3.1 Alignment of the reference sequences
  - 3.1.1 Alignment with the reference sequence used in HGMD
  - 3.1.2 Alignment with the reference sequence used in SNPdb
- 3.2 Mapping
4 Discussion
- 4.1 dbSNP

Task description

A detailed task description can be found here: Mapping point mutations

SNP databases

HGMD

HGMD
Searched for PAH
429 Missense/Nonsense mutations known by HGMD Professional

There are several mutation types known for PAH:

Missense/nonsense
Splicing
Regulatory
Small deletions
Small insertions
Small indels
Gross deletions
Gross insertions/duplications
Complex rearrangements

One additional category of mutation is known, but is not recorded for PAH

Repeat variations

Reference Sequence

The reference sequence is given by the accession number NM_000277.1, whose entry contains the following amino acid sequence:

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEEN DVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDI GATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ FADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCG FHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPM YTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLC KQGDSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESF NDAKEKVRNFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK IK

SNPs

dbSNP

Clarifications on insertions and deletions

The impact of insertions and deletions in coding regions on chromosomes can be differently. Fatal for a protein are insertions or deletions which introduce a frame shift. This happens when the length of the insertion or deletion is not divisible by 3 without producing a rest. Or in other words: Len mod 3 != 0 . Where Len is the length of the insertion or deletion.

Methodology

Retrieving Data

Figure 1: Shows the query and result for our silent mutation search

We searched in dbSNP for silent mutations in coding regions. This means we only considered those SNPs which alter the triplet but not the amino acid.

To do so we used the Entrez interface of NCBI which is accessible under this URL:

://www.ncbi.nlm.nih.gov/sites/entrez?Db=snp

The advantage of this Entrez interface is that we can construct arbitrary complex queries to restrict our result set.

We constructed the following query to search for SNPs which are considered silent in the coding regions of the human PAH gene (see figure 1):

"synonymous-codon"[Function_Class] AND PAH[GENE] AND "human"[ORGN] AND "snp"[SNP_CLASS]

Results of this query can be accessed directly via the following URL:

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=snp&Cmd=DetailsSearch&Term=%22synonymous-codon%22%5BFunction_Class%5D+AND+PAH%5BGENE%5D+AND+%22human%22%5BORGN%5D+AND+%22snp%22%5BSNP_CLASS%5D

We decided to download the results as FlatFile. This seemed to be the most simple format to process and contains almost all information we need.

Processing Data

A self written Perl script is helping us to parse the important information out of the FlatFile we downloaded in the previous step. In the current version the following information is parsed out of the FlatFile: identifier, triplet reference/mutated, allele reference/mutated , frame of the mutation, residue reference/mutated and residue position.

All annotations we retrieved are annotated for the mRNA sequence with the GenBank accession number NM_000277.1 which contains the coding sequences of our PAH protein with the accession number NP_000268.1 which is exactly the same protein and therefore has also the same sequence as the UniProt entry for PAH with the accession number P00439. Thus, no mapping to other residue coordinates was required to map these mutations to our mutation map.

With our perl script at hand (the code is accessible here: dbSNP Silent Mutations Parser) and the results in FlatFile format, we are only missing the CDS sequence of NM_000277.1 in order to get the used triplet for each residue. This CDS sequence can be found in the CCDS database of NCBI with the accession number CCDS9092.1.

Now, with all data at hand we can run our perl script in different ways to retrieve different outputs for different purposes. Currently the following three outputs with the following commands can be generated:

Generates a WikiTable of found silent mutations: ./parse.pl wiki snp_result.txt ccds_pah.fasta
Generates CSVs which can be used by our mapping tool to generate the mutation map: ./parse.pl map snp_result.txt ccds_pah.fasta
Generates a human readable one liner for each silent mutation: ./parse.pl list snp_result.txt ccds_pah.fasta

Results

We could find the following silent mutations in dbSNP:

Identifier	AA-Position	Reference Triplet	Mutated Triplet	Reference Allele	Mutated Allele	Frame	Reference Residue	Mutated Residue
rs117308669	66	GAA	GAG	A	G	3	E	E
rs75065106	258	CTG	TTG	C	T	1	L	L
rs62651567	323	ACA	ACG	A	G	3	T	T
rs62508648	367	CTG	CTA	G	A	3	L	L
rs61747292	321	CTC	CTT	C	T	3	L	L
rs59326968	426	AAT	AAC	T	C	3	N	N
rs17852374	36	TCA	TCG	A	G	3	S	S
rs1801152	414	TAC	TAT	C	T	3	Y	Y
rs1801151	400	AGG	CGG	A	C	1	R	R
rs1801150	399	GTA	GTT	A	T	3	V	V
rs1801147	203	TGC	TGT	C	T	3	C	C
rs1801146	137	AGC	AGT	C	T	3	S	S
rs1801145	10	GGC	GGG	C	G	3	G	G
rs1126758	232	CAG	CAG	A	G	3	Q	Q
rs1042503	245	GTG	GTA	G	A	3	V	V
rs772897	385	CTG	CTC	G	C	3	L	L

Comparing the annotation of HGMD and SNPdb

Alignment of the reference sequences

We decided to use the sequence of PAH of Uniprot (see UniProt).

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKIK

Alignment with the reference sequence used in HGMD

The resulting alignment shows a 100% identity without any gaps. Therefore it is a "self-alignment".

Alignment with the reference sequence used in SNPdb

Mapping

Discussion

dbSNP

At first, we were quite surprised that we found only 16 silent mutation for the PAH gene. Normally, silent mutations are expected to appear more frequent in coding regions than missense/nonsense mutations because they are not subject to positive selection pressure. However, we might think that the few known silent mutations in PAH is probably a result of lack of data.

Allele Mutation	Absolute Frequency	Relative Frequency
A->T	1	0.0625
A->C	1	0.0625	A->G	4	0.25
T->A	0	0
T->C	1	0.0625
T->G	0	0
C->A	0	0
C->T	5	0.3125
C->G	1	0.0625
G->A	2	0.125
G->T	0	0
G->C	1	0.0625
Purine->Purine	6	0.375
Purine->Pyrimidine	3	0.1875
Pyrimidine->Pyrimidine	6	0.375
Pyrimidine->Purine	1	0.0625

@@ Line 275: / Line 275: @@
 At first, we were quite surprised that we found only 16 silent mutation for the PAH gene. Normally, silent mutations are expected to appear more frequent in coding regions than missense/nonsense mutations because they are not subject to positive selection pressure. However, we might think that the few known silent mutations in PAH is probably a result of lack of data.
+{| border="1"
+|-
+! Allele Mutation
+! Absolute Frequency
+! Relative Frequency
+|-
+| A->T
+| 1
+| 0.0625
+|-
+| A->C
+| 1
+| 0.0625
+| A->G
+| 4
+| 0.25
+|-
+| T->A
+| 0
+| 0
+|-
+| T->C
+| 1
+| 0.0625
+|-
+| T->G
+| 0
+| 0
+|-
+| C->A
+| 0
+| 0
+|-
+| C->T
+| 5
+| 0.3125
+|-
+| C->G
+| 1
+| 0.0625
+|-
+| G->A
+| 2
+| 0.125
+|-
+| G->T
+| 0
+| 0
+|-
+| G->C
+| 1
+| 0.0625
+|-
+| Purine->Purine
+| 6
+| 0.375
+|-
+| Purine->Pyrimidine
+| 3
+| 0.1875
+|-
+| Pyrimidine->Pyrimidine
+| 6
+| 0.375
+|-
+| Pyrimidine->Purine
+| 1
+| 0.0625
+|}

Difference between revisions of "Task 5: Mapping point mutations"

Revision as of 11:42, 19 June 2011

Contents

Task description

SNP databases

HGMD

Reference Sequence

SNPs

dbSNP

Clarifications on insertions and deletions

Methodology

Retrieving Data

Processing Data

Results

Comparing the annotation of HGMD and SNPdb

Alignment of the reference sequences

Alignment with the reference sequence used in HGMD

Alignment with the reference sequence used in SNPdb

Mapping

Discussion

dbSNP

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools