Mapping SNPs HEXA

From Bioinformatikpedia
Revision as of 17:36, 17 June 2011 by Uskat (talk | contribs) (Statistical comparison of HGMD and SNP-DB)

Methods

First of all, we had to parse the HGMD database and the DB-SNP database.

HGMD

We logged us in and searched for Tay-Sachs diseases and chose one entry of HEXA. In our case there were two entries with identical content. We only looked at the missense/nonsense mutations, which are 68 annotated in HGMD. We just copied the webpage in a textfile and than wrote a short parser, which parse the codon change, amino acid change and codon number.

DBSNP

It was more complicated to parse the DBSNP output, than the output of HGMD. First of all, we search for HEXA in this database and chose only the SNPs which occur in human. We used the grapical output and again copied and pasted the page in a textfile. An entry in DBSNP has following structure: First of all there is the name of the SNP. Next the sequence and the graphical representation of the sequence. In the next line, there are the allel origin, the clinical relevance and last the annotations of the mutation. TBA...

Comparison of mutations in HGMD and SNP-DB

Mutations annotated in both databases

mutations which are not silent and cause a phenotype

SNP-DB Identifier Codonposition Mutationposition Amino Acids Codons
rs121907964 26 3 Trp -> TER TGGc -> TGA
rs121907979 39 2 Leu -> Arg CTT -> CGT
rs121907975 127 1 Leu -> Phe aCTC -> TTC
2 Leu -> Arg CTC -> CGC
rs121907962 137 1 Arg -> TER cCGA -> TGA
rs121907972 170 1 Arg -> Trp cCGG -> TGG
rs121907957 2 Arg -> Gln CGG -> CAG
 rs28941770 178 2 Arg -> His CGC -> CAC
2 Arg -> Leu CGC -> CTC
rs121907953 1 Arg -> Cys tCGC -> TGC
rs121907969 180 3 Tyr -> TER TACc -> TAG
rs28941771 1 Tyr -> His tTAC -> CAC
rs121907973 197 2 Lys -> Thr AAA -> ACA
rs1800429 200 1 Val -> Met cGTG -> ATG
rs121907976 204 2 His -> Arg CAT -> CGT
rs121907961 210 2 Ser -> Phe TCC -> TTC
rs121907974 211 2 Phe -> Ser TTC -> TCC
rs121907970 247 1 Arg -> Trp aCGG -> TGG
rs121907959 250 1 Gly -> Ser gGGT -> AGT
2 Gly -> Asp GGT -> GAT
2 Gly -> Val GGT -> GTT
rs121907971 258 1 Asp -> His tGAC -> CAC
rs121907954 269 1 Gly -> Ser aGGT -> AGT
2 Gly -> Asp GGT -> GAT
rs121907977 301 2 Met -> Arg ATG -> AGG
rs121907967 329 2 Trp -> TER TGG -> TAG
rs121907963 393 1 Arg -> TER gCGA -> TGA
 rs121907958 420 3 Trp -> Cys TGGt -> TGC
3 Trp -> Cys TGGt -> TGT
rs28940871 451 1 Leu -> Val tCTG -> GTG
 rs121907978 454 2 Gly -> Asp GGT -> GAT
1 Gly -> Ser tGGT -> AGT
rs121907981 474 3 Trp -> Cys TGGc -> TGC
rs121907952 482 1 Glu -> Lys cGAA -> AAA
rs121907968 485 1 Trp -> Arg gTGG -> CGG
rs121907966 499 1 Arg -> Cys aCGT -> TGT
rs121907956 2 Arg -> His CGT -> CAT
rs28942071 504 1 Arg -> Cys cCGC -> TGC
rs121907955 2 Arg -> His CGC -> CAC
 rs4777502  506 3 Glu -> Asp GAA -> GAC
3 Glu -> Asp GAA -> GAT

Graphical representation:

 MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWP!PQNFQTSDQRYVRYPNNFQFQYDVSSAAQPGCSVLD


EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD


QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLFLSETVWGAL!GLETFSQLVWKSAEGTFFINKTEIEDFPRFPHWGLLLDTSHH!LPLS


SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNTLNMFHWRLVDDPFSPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY


ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
AWLRSIRVLAEFHTPGHTLSWGPSIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFRSTFFL


EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTC!KSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG


KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVW!EDIPVNYMKELELVTKAGFRALLSAPCYLNRISYG


PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYIVEPLAFEGTPEQKAVVIDGEACMWGEYVDNTNLVPRLCPRAGAVAKRLRSNKL


TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYECLSHFCCDLLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation




Mutations annotated only in HGMD

not found

Mutations annotated only in SNP-DB

mutations annotated only in SNP-DB and not silent (pos in codon ist annotiert und sicher):

Here we list all mutations which are annotated in the SNP-DB, not silent and not annotated in the HGMD database. Some of these mutations have a really detailed NP annotation, other mutations do not have a very detailed annotation and therefore we had to map these mutations. The detail list of the mutations, which are mapped by us can be found [here].

SNP-DB Identifier Codonposition Mutationposition Amino Acids Codons
rs4777505 29 2 Asn -> Ser AAC -> AGC
rs61731240 179 1 His -> Asp CAT -> GAT
rs3743230 208 1 Asn -> Asp AAC -> GAC
rs61747114 248 1 Leu -> Phe CTT -> TTT
rs1054374 293 2 Ser -> Ile AGT -> ATT
rs1800430 399 1 Asn -> Asp AAC -> GAC
rs1800431 436 1 Ile -> Val ATA -> GTA
rs121907982 456 2 Tyr -> Ser TAT -> TCT

Graphical representation:

 MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWPWPQSFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD


EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD


QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRDYLPLS


SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY


ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
ARFRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPILNNTYEFMSTFFL


EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG


KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVDYMKELELVTKAGFRALLSAPWYLNRISYG


PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYVVEPLAFEGTPEQKALVIGGSACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL


TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation

silent mutation

Because this mutations are not annotated in the HGMD database, the mutations have to be silent and do not lead to a phenotype.
This mutations are badly annotated in the SNP-DB. Therefore we rotated the found codons, because the original codon has to code for the amino acid which occur in the protein sequence. Therefore we used the codon with the mutation at position one, position two and position three. Next we also reverse it and made the complemetary sequence of both. The detailed result can be seen [here]

If we found more than one nucleotide combination which code for the amino acid in the protein sequence, we list all of them in the following table, if they are silent mutations. Otherwise, we also listed all mutations which are not silent, if there do not exist any other possible mutation for this postition.

Here is the result, which combinations are possible:

SPN-DB Identifier Codonposition Mutationposition Amino Acids Codons translation
rs1800428 3 3 Ser -> Ser AGC -> AGT Forward
rs11551324 109 3 Thr -> Thr ACC -> ACT forward
rs28942072  324 3 Val -> Val GTT -> GGA Forward
rs28942072 Val -> Val GTC -> GTT Forward
rs34085965 446 1 Pro -> Pro CCT -> CCC complemantry reverse
rs4777502 506 3 Glu -> Glu GAG -> GAA Forward

Graphical representation:

 MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD


EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD


QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS


SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY


ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL


EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG


KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG


PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL


TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation


Summary

Graphical representation:

 MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWP!PQSFQTSDQRYVRYPNNFQFQYDVSSAAQPGCSVLD


EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD


QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLFLSETVWGAL!GLETFSQLVWKSAEGTFFINKTEIEDFPRFPHWGLLLDTSHD!LPLS


SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNTLNMFHWRLVDDPFSPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY


ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
AWFRSIRVLAEFHTPGHTLSWGPSIPGLLTPCYSGSEPSGTFGPVNPILNNTYEFRSTFFL


EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTC!KSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG


KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVW!EDIPVDYMKELELVTKAGFRALLSAPCYLNRISYG


PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYVVEPLAFEGTPEQKAVVIDGSACMWGEYVDNTNLVPRLCPRAGAVAKRLRSNKL


TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYECLSHFCCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation

Statistical comparison of HGMD and SNP-DB

For the analysis of the two different database results we decided to do some statistical comparison.

First of all we compared the different resulting tables (see above) according to their mutationposition in the triplet. Therefore we created a barplot which shows the precentage of the frequency for the corresponding mutationposition.

For the first case where the overlapping results of both databases. One can see here that there are as much mutations at the first postion as at the second position. The occurance of a mutation on the third position deviates from the others which means it is much rarer. The reason for this ist that the database HGMD contains no silent mutations which means that the overlap of both databases do not contain them as well. The third position of a triplet often causes a silent mutation and there for only a few mutations on the third position result in an amino acid change. Therefore this explains why the third position is as rare while the other positions are equal frequent.

The second case displays the position frequency of the corresponding which are only resulting in DBSNP. Here one can see that mutations at position one are more frequent than at position two and that there are no mutations at position three. The resulting mutations only for DBSNP are not silent which can explain why there is no mutation at position three (mutations at the third position of a triplet are very often silent). Besides there is probably no special reason why the first position is more common than the second one for a mutation.


Figure 1: Barplot of the mutationposition for the differen tables