Glucocerebrosidase mapping snps

From Bioinformatikpedia
Revision as of 23:54, 25 September 2011 by Braunt (talk | contribs) (References)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

General

HGMD

[http://www.hgmd.cf.ac.uk/ac/ HGMD, the Human Gene Mutation Database, contains germline mutations that are linked to human diseases. There are several types of mutations:

  • missense/nonsense: codon codes for a different amino acid/premature stop codon
  • splicing: a mutation that causes splicing
  • regulatory: mutation affecting the regulation of gene expression
  • small/gross deletions: mutation that deletes residues
  • small/gross insertions: mutation that inserts residues
  • small indels: insertion or deletion (maybe not recognizable)
  • duplications: duplicated sequence pieces
  • complex rearrangements: part of the sequence is placed somewhere else
  • repeat variations: repeated varied parts of the sequence are placed somewhere else

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) from NCBI and the National Human Genome Research Institute (NHGRI)was formed in 1998. <ref>http://en.wikipedia.org/wiki/DbSNP</ref> It contains several types of mutations for 55 organisms including Homo Sapiens:

  • SNPs (single nucleotide polymorphisms)
  • MNPs (multinucleotide polymorphisms)
  • small deletions
  • small insertions
  • small indels
  • short tandem repeats (STRs)

HGMD: Mutations for GBA

Overview

To get the different mutation types for the GBA gene, which is the gene causing Gaucher Disease, HMGD was searched for GBA. The resulting different mutation types are shown in the table below.

mutation type number of mutations
missense/nonsense 236
splicing 13
regulatory 0
small deletions 23
small insertions 13
small indels 2
gross deletions 3
sross insertions/duplications 0
complex rearrangements 13
repeat variations 0
public total (HGMD Professional 2011.1 total) 303 (353)



In this case, the missense/nonsense mutations are of interest, as they cause a change in the amino acid sequence. Such single point mutations seem to be responsible for Gaucher Disease, so the analysis is focused on them.

Missense/nonsense mutations given for GBA

The following table provides a detailed overview of the 236 missense/nonsense mutations found in GBA:

Codon change Amino acid change Codon number
AAG-AGG Lys-Arg -27
TGGg-TGA Trp-Term -4
cGGC-AGC Gly-Ser 10
AGC-ATC Ser-Ile 12
gGTG-ATG Val-Met 15
gGTG-CTG Val-Leu 15
TGT-TCT Cys-Ser 16
tGAC-AAC Asp-Asn 24
tGGT-AGT Gly-Ser 35
cTTC-GTC Phe-Val 37
tGAG-AAG Glu-Lys 41
AGT-AAT Ser-Asn 42
ACA-ATA Thr-Ile 43
GGG-GAG Gly-Glu 46
gCGA-TGA Arg-Term 47
aCGG-TGG Arg-Trp 48
CGG-CAG Arg-Gln 48
CTA-CCA Leu-Pro 66
aCAG-TAG Gln-Term 73
gAAG-TAG Lys-Term 74
GTG-GCG Val-Ala 78
AAGg-AAC Lys-Asn 79
ATG-ACG Met-Thr 85
tGCT-ACT Ala-Thr 90
CTT-CGT Leu-Arg 105
TCG-TTG Ser-Leu 107
cTTC-GTC Phe-Val 109
tGAA-AAA Glu-Lys 111
GGA-GAA Gly-Glu 113
tAAC-GAC Asn-Asp 117
ATC-ACC Ile-Thr 119
ATC-AGC Ile-Ser 119
cCGG-TGG Arg-Trp 120
CGG-CAG Arg-Gln 120
GTA-GCA Val-Ala 121
aCCC-TCC Pro-Ser 122
CCC-CTC Pro-Leu 122
ATG-ACG Met-Thr 123
cATG-GTG Met-Val 123
GAC-GTC Asp-Val 127
cCGC-TGC Arg-Cys 131
CGC-CTC Arg-Leu 131
ACC-ATC Thr-Ile 134
cACC-CCC Thr-Pro 134
TATg-TAG Tyr-Term 135
GCA-GAA Ala-Glu 136
tGAT-CAT Asp-His 140
GAA-GCA Glu-Ala 152
AAGa-AAT Lys-Asn 157
cAAG-CAG Lys-Gln 157
aCCC-ACC Pro-Thr 159
CCC-CTC Pro-Leu 159
ATT-AAT Ile-Asn 161
ATT-AGT Ile-Ser 161
CAC-CCC His-Pro 162
cCGA-TGA Arg-Term 163
cCAG-TAG Gln-Term 169
CGT-CCT Arg-Pro 170
gCGT-TGT Arg-Cys 170
TCA-TGA Ser-Term 173
aCTC-TTC Leu-Phe 174
CTC-CCC Leu-Pro 174
GCC-GAC Ala-Asp 176
cCCC-TCC Pro-Ser 178
TGG-TAG Trp-Term 179
gACA-CCA Thr-Pro 180
aCCC-ACC Pro-Thr 182
CCC-CTC Pro-Leu 182
tTGG-CGG Trp-Arg 184
gCTC-TTC Leu-Phe 185
AAT-AGT Asn-Ser 188
AATg-AAG Asn-Lys 188
GGA-GTA Gly-Val 189
aGCG-ACG Ala-Thr 190
GCG-GAG Ala-Glu 190
GTG-GAG Val-Glu 191
GTG-GGG Val-Gly 191
GGG-GAG Gly-Glu 195
gGGG-TGG Gly-Trp 195
gTCA-CCA Ser-Pro 196
aCTC-TTC Leu-Phe 197
CTC-CCC Leu-Pro 197
AAG-ACG Lys-Thr 198
cAAG-GAG Lys-Glu 198
cGGA-AGA Gly-Arg 202
GGA-GAA Gly-Glu 202
TAC-TGC Tyr-Cys 205
cTGG-CGG Trp-Arg 209
GCC-GTC Ala-Val 210
aTAC-CAC Tyr-His 212
cTTT-ATT Phe-Ile 213
TTT-TGT Phe-Cys 213
gTTC-GTC Phe-Val 216
TTC-TAC Phe-Tyr 216
TAT-TGT Tyr-Cys 220
ACA-AGA Thr-Arg 231
GAAa-GAC Glu-Asp 233
tGAA-TAA Glu-Term 233
tTCT-CCT Ser-Pro 237
GGG-GTG Gly-Val 239
GGA-GTA Gly-Val 243
aTAC-CAC Tyr-His 244
CCC-CAC Pro-His 245
TTCa-TTA Phe-Leu 251
CATc-CAG His-Gln 255
CGA-CAA Arg-Gln 257
gCGA-TGA Arg-Term 257
TTCa-TTA Phe-Leu 259
ATT-ACT Ile-Thr 260
GGT-GAT Gly-Asp 265
CCT-CGT Pro-Arg 266
CCT-CTT Pro-Leu 266
tCCT-GCT Pro-Ala 266
AGT-AAT Ser-Asn 271
CTC-CCC Leu-Pro 279
aCGC-TGC Arg-Cys 285
CGC-CAC Arg-His 285
CCC-CTC Pro-Leu 289
aCTG-TTG Leu-Leu 296
AAA-ATA Lys-Ile 303
TAT-TGT Tyr-Cys 304
TATg-TAG Tyr-Term 304
tGTT-CTT Val-Leu 305
GCT-GTT Ala-Val 309
CAT-CGT His-Arg 311
TGGt-TGT Trp-Cys 312
tTGG-CGG Trp-Arg 312
gTAC-CAC Tyr-His 313
gGAC-CAC Asp-His 315
GCT-GAT Ala-Asp 318
tCCA-GCA Pro-Ala 319
ACC-ATC Thr-Ile 323
CTA-CAA Leu-Gln 324
CTA-CCA Leu-Pro 324
aGGG-AGG Gly-Arg 325
aGGG-TGG Gly-Trp 325
gGAG-AAG Glu-Lys 326
cCGC-TGC Arg-Cys 329
TTC-TCC Phe-Ser 331
CTC-CCC Leu-Pro 336
gGCC-ACC Ala-Thr 341
cTGT-CGT Cys-Arg 342
cTGT-GGT Cys-Gly 342
TGT-TAT Cys-Tyr 342
cTGG-GGG Trp-Gly 348
gGAG-AAG Glu-Lys 349
gCAG-TAG Gln-Term 350
tGTG-CTG Val-Leu 352
gCGG-GGG Arg-Gly 353
gCGG-TGG Arg-Trp 353
GGC-GAC Gly-Asp 355
TCC-TTC Ser-Phe 356
TGG-TAG Trp-Term 357
CGA-CAA Arg-Gln 359
tCGA-TGA Arg-Term 359
ATGc-ATA Met-Ile 361
TAC-TGC Tyr-Cys 363
AGC-AAC Ser-Asn 364
AGC-ACC Ser-Thr 364
cAGC-CGC Ser-Arg 364
AGC-AAC Ser-Asn 366
AGC-ACC Ser-Thr 366
cAGC-GGC Ser-Gly 366
ACG-ATG Thr-Met 369
AAC-AGC Asn-Ser 370
AACc-AAA Asn-Lys 370
cCTC-GTC Leu-Val 371
tGTG-TTG Val-Leu 375
cGGC-AGC Gly-Ser 377
cTGG-GGG Trp-Gly 378
TGG-TAG Trp-Term 378
cGAC-AAC Asp-Asn 380
cGAC-CAC Asp-His 380
GAC-GCC Asp-Ala 380
TGG-TAG Trp-Term 381
AACc-AAA Asn-Lys 382
CTT-CGT Leu-Arg 383
CTG-CCG Leu-Pro 385
CCC-CTC Pro-Leu 387
cGAA-TAA Glu-Term 388
GGA-GAA Gly-Glu 389
aGGA-AGA Gly-Arg 390
CCC-CTC Pro-Leu 391
AAT-ATT Asn-Ile 392
TGG-TTG Trp-Leu 393
tTGG-AGG Trp-Arg 393
gGTG-TTG Val-Leu 394
CGT-CCT Arg-Pro 395
gCGT-TGT Arg-Cys 395
AAC-ACC Asn-Thr 396
TTT-TCT Phe-Ser 397
tGTC-ATC Val-Ile 398
tGTC-CTC Val-Leu 398
tGTC-TTC Val-Phe 398
cGAC-AAC Asp-Asn 399
cGAC-TAC Asp-Tyr 399
CCC-CTC Pro-Leu 401
ATC-ACC Ile-Thr 402
cATC-TTC Ile-Phe 402
GAC-GGC Asp-Gly 409
GAC-GTC Asp-Val 409
gGAC-CAC Asp-His 409
gTTT-ATT Phe-Ile 411
tTAC-CAC Tyr-His 412
cAAA-CAA Lys-Gln 413
aCAG-TAG Gln-Term 414
CAG-CGG Gln-Arg 414
CCC-CGC Pro-Arg 415
cATG-GTG Met-Val 416
gTTC-GTC Phe-Val 417
TAC-TGC Tyr-Cys 418
GGC-GAC Gly-Asp 421
cAAG-GAG Lys-Glu 425
AGAg-AGT Arg-Ser 433
gAGA-GGA Arg-Gly 433
CTG-CCG Leu-Pro 444
CTG-CGG Leu-Arg 444
cGCA-CCA Ala-Pro 446
CAT-CGT His-Arg 451
tGCT-CCT Ala-Pro 456
cGTG-ATG Val-Met 460
CTA-CCA Leu-Pro 461
AAC-AGC Asn-Ser 462
AACc-AAG Asn-Lys 462
cCGC-TGC Arg-Cys 463
CGC-CAC Arg-His 463
CGC-CCC Arg-Pro 463
gGAT-TAT Asp-Tyr 474
gGGC-AGC Gly-Ser 478
CTG-CCG Leu-Pro 480
cTCC-CCC Ser-Pro 488
ATT-ACT Ile-Thr 489
ACC-ATC Thr-Ile 491
CGC-CAC Arg-His 496
tCGC-TGC Arg-Cys 496
CAG-CGG Gln-Arg 497

Sequence

For mapping the mutations to the sequence we used the one of the given accession number NM_001005741.1. That is exactly the sequence we also used for our interpretations before. With the help of a Perl script we generated the following sequences and marked the given mutations.

Positions where mutations occur

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Positions for possible missense/nonsense mutations are marked red.

Possible mutated amino acid residues

The following sequence shows the different possibilities for mutated residues. As there are different mutations for the same position, all changed residues are shown, each in a separate line. The first row shows the original sequence, the second, third and fourth line show mutated residues. Positions for possible missense/nonsense mutations are marked red.

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3

MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
MEFSSPSREECPRPLSRVSIMAGSLTGLLLLQAVS!ASGARPCIPKSFSYISVMSVCNATYCNSFDPPTFPALSTVSRYKN
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVLCVCNATYCDSFDPPTFPALGTFSRYES
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES

TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
IRSE!WMELSMGPIQANHTGTGLPLTLQPE!!FQKANGFGGATTDAATLNILALSPPAQNLLRKLYVSKEEIGYDITWAST
TRSGRQMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNISQVLV
TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM

ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
ASCVFSICTYI!EDTPHDFQLHNFSLPEADTKLNITLNP!ALQLA!PPV!FLDSS!PSTTRFKTSVTENGKEPFTGQPRDI
ASCDFSILTYPYADTPDDFQLHNFSLPEEDTKLQILLSHRALQLAQCPVSPLASPWTSLTWLKTKGEGNGKWSPEGQPEDI
ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI

YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
CHQTRVRHIVKVLDACAEHKLQFWAVRADNEPPAVLLSVHHFQCLGLTPEQQQDLTARDLDRTLANNTHHNVRLPMLDDQC
YHQTWARYCVKYLDAYAEHKLQFWAVTA!NEPSAGLLSGYPFQCLGFTPEHQ!DFIARDLGLTLANSTHHNVRLLMLDDQH
YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGATLANSTHHNVRLLMLDDQR

LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
LLLLHWAKVVLTDPEAAICLHGIVVRCHLHFLDAAKAIQRKTHCLSPNTMPFASETRVGSKFGK!SLGLDF!DQGIQCNHN
LLLPHWAKVVLTDPEAAK!VHGIAVHRYLDFLAPAKATPWETHRLFPNTMLFASEAGVGSKFWEQSVWLGSWD!GMQYTHT
LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEAYVGSKFWEQSVRLGSWDRGMQYRHG

IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
IIMSVLYHLVSGTN!KRAPNL!ERLILLPTSINSLTIVDITKGTIHQ!RVVCHLDHFSEFIPEGSQSVGLVASQKNDPDPV
IITKLLYHVVG!THWNLALNPEGGPNRVCNFLYSPFIVDITKVTFYKRPMFYHLGHFSKFIPEGSQGVGLVASQKNDRDAV
IITNLLYHVVGWTAWNLALNPEGGPNWVRNFFDSPIIVDITKHTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV

ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
ALMRPDGSPVVVMPSCSSKDVPLTIKYPAVSFPETISPGYPTHIYLWRHR
ALMHPDGSAVVVVLKHSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRCQ
ALMHPDGSAVVVVLNPSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

dbSNP: Mutations for GBA

dbSNP was searched for synonymous mutations as well as for missense mutations. Synonymous mutations do not have an influence on the resulting amino acid which means that the residue remains the same after the mutation. The output of dbSNP was also parsed with a Perl script, where we used FlatFile and the gene map. The positions where the same as in our reference sequence, so we could use them again.

Synonymous Mutations in GBA

In the following table the synonymous mutations for GBA are listed:

ID mutated allele amino acid codon position amino acid position
rs78297361 T R 3 535
rs77130994 A G 3 517
rs1135675 C V 3 499
rs12747811 A Q 3 471
rs79226895 A K 3 464
rs78346899 T Y 3 451
rs75034092 A G 3 416
rs74498117 G L 3 410
rs1141826 A T 3 408
rs75391747 A E 3 388
rs80317710 A E 3 365
rs79311125 T Y 3 352
rs1064647 T G 3 346
rs1064646 G K 3 342
rs74486098 A K 3 237
rs76158190 C S 3 235
rs75370695 A A 3 229
rs76682322 T L 3 224
rs76727497 A P 3 221
rs76717906 T P 3 217
rs78659905 T L 3 213
rs77916306 A P 3 198
rs77191198 A T 3 173
rs74572011 T R 3 170
rs79767521 T P 3 161
rs75249684 C R 3 159
rs79175920 A A 3 129
rs1141821 C T 3 100
rs1141816 A G 3 93
rs78669556 C R 3 87
rs1141810 C S 3 81
rs76337315 A E 3 80
rs1141807 C Y 3 79

Sequence

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Positions for possible synonymous mutations are marked blue.

Missense Mutations in GBA

In the following table the missense mutations are listed:

ID mutated allele amino acid codon position amino acid position
rs75822236 A H 2 535
rs78016673 T I 2 530
rs77409925 G E 3 513
rs113825752 C P 2 509
rs76071730 G R 2 490
rs74752878 G C 2 457
rs79185870 G L 3 456
rs80020805 A I 3 455
rs77035024 A L 3 450
rs78802049 G E 3 448
rs75564605 C T 2 441
rs75090908 G E 3 438
rs75243000 C S 2 436
rs75385858 C T 2 435
rs77738682 T I 2 431
rs76910485 T L 2 430
rs78715199 A E 3 419
rs77284004 C A 2 419
rs76014919 T C 3 417
rs2230289 T M 2 408
rs75528494 A R 3 405
rs76228122 G C 2 402
rs74979486 A Q 2 398
rs11558184 A Q 2 392
rs1064648 A H 2 368
rs78188205 A D 2 357
rs77321207 G C 2 343
rs77714449 T I 2 342
rs79696831 A H 2 324
rs74731340 A N 2 310
rs79215220 G R 2 305
rs80116658 A D 2 304
rs76725886 G R 2 270
rs79945741 A L 3 252
rs76026102 G C 2 244
rs77451368 A E 2 241
rs74462743 A E 2 234
rs75636769 A E 2 229
rs78911246 T V 2 228
rs80205046 T L 2 221
rs76500263 C P 2 201
rs80222298 T L 2 198
rs78446355 C N 3 196
rs79660787 A E 2 175
rs78657146 T I 2 173
rs75690705 T L 2 170
rs79796061 T V 2 166
rs77959976 A I 3 162
rs79637617 T L 2 161
rs77834747 G S 2 158
rs77019233 A K 3 156
rs1141820 G R 2 99
rs1141818 T Y 1 99
rs78769774 A Q 2 87
rs1141812 A S 1 83
rs1141808 A K 1 80
rs75954905 G L 3 76
rs74953658 A E 3 63

Sequence

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCESFDPPTFPALGTLSRYKS

TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
TSSGRQMELSMGPIQANYTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYKISRVLI

ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
ASCVFSILTYIYEDTPDDFQLHNFSLPEEDTKLNILLIPRALQLAQRPVSLLASPWTSLTWLKTNVEVNGKESLKGQPEDI

YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
CHQTWARYLVKFLDAYAEHKLQFWAVRAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLDRTLANNTHHNVRLLMLDDQH

LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
LLLPHWAKVVLTDPEAAICVHGIAVHWYLDFLDPAKATLGETHHLFPNTMLFASEACVGSKFWEQSVQLGSWDQGMQCSHR

IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
IIMNLLYHVVGCTAWNLALNPEGGLIWVRTSVESPTIVDITKETLYKQPILCHLGHFSKFIPEGSQRVGLVASQKNDLDAV

ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
ALMRPDGSAVVVVLNRSSKDVPPTIKEPAVGFLETISPGYSIHIYLWRHQ

Positions for possible missense mutations are marked red.

Sequence with synonymous and missense mutations

The following sequence shows the synonymous and the missense mutations found with dbSNP. As there are positions with possible synonymous or missense mutations the second line shows the missense mutations and the third one the synonymous ones.

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3

MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCESFDPPTFPALGTLSRYKS
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES

TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
TSSGRQMELSMGPIQANYTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYKISRVLI
TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM

ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
ASCVFSILTYIYEDTPDDFQLHNFSLPEEDTKLNILLIPRALQLAQRPVSLLASPWTSLTWLKTNVEVNGKESLKGQPEDI
ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI

YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
CHQTWARYLVKFLDAYAEHKLQFWAVRAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLDRTLANNTHHNVRLLMLDDQH
YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR

LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
LLLPHWAKVVLTDPEAAICVHGIAVHWYLDFLDPAKATLGETHHLFPNTMLFASEACVGSKFWEQSVQLGSWDQGMQCSHR
LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS

IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
IIMNLLYHVVGCTAWNLALNPEGGLIWVRTSVESPTIVDITKETLYKQPILCHLGHFSKFIPEGSQRVGLVASQKNDLDAV
IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV

ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
ALMRPDGSAVVVVLNRSSKDVPPTIKEPAVGFLETISPGYSIHIYLWRHQ
ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Positions for possible synonymous mutations are marked blue, positions for possible missense mutations are marked red.

Mutation map

To create the mutation map with all missense and synonymous mutations listed in dbSNP and HGMD the corresponding sequence positions were mapped together, which was quite simple as both databases use the same sequence and only the numbering was slightly different.

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYES
TRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGFGGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPM
ASCDFSIRTYTYADTPDDFQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQPGDI
YHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIARDLGPTLANSTHHNVRLLMLDDQR
LLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAKATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHS
IITNLLYHVVGWTDWNLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQKNDLDAV
ALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ


Positions for possible missense mutations are marked:

  • red, if the mutation is only listed in HGMD
  • blue, if the mutation is only listed in dbSNP
  • green, if the mutation is listed in dbSNP and HGMD

Underlined residues represent active site and residues forming hydrogen bonds with the active site. <ref>Kim et al., Crystal Structure of the Salmonella enterica Serovar Typhimurium Virulence Factor SrfJ, a Glycoside Hydrolase Family Enzyme. Journal of Bacteriology, 2009, p. 6550-6554, Vol. 191, No. 21 </ref>

The mutation map shows, that no mutation of the active sites Glu235 and Glu340 are known, whereas two of the residues forming hydrogen bonds with the active site (Arg120, Asp282, His311) are listed in the mutation databases. [Note, that the position of the residues of interest is indicated for the mature protein, which does not contain the signal peptide. The mutation map contains the 39 residue signal peptide.]

Statistical analyses

Now we want to analyse our results. Therefore we make some statistical analyses to see if there are amino acids that mutate more often than others and which amino acids are substituted etc.

Synonymous mutations

Figure 1: Diagram that shows how often which amino acid was mutated for synonymous mutations

First we want to look at the synonymous mutations. Synonymous mutations describe a mutated codon that encodes the same amino acid as the original codon and therefore does not affect the protein sequence. In this case always the third position of the codon got mutated. This is not surprising, as codons that encode the same amino acids differ mostly at their third position.

The amino acids that mutated most in synonymous mutations are arginine, glycine and proline, which you can see in Figure 1. If you look at the codon table <ref>http://en.wikipedia.org/wiki/Genetic_code</ref> there are six different codons for arginine, four for glycine and also four for proline. Glutamic acid, leucine, lysine, threonine and tyrosine mutated three times and they have two, six, two, four and two codons encoding the amino acids. You can see that the amino acids which mutate most often in synonymous mutations also have many codons. For the amino acids that mutate three times it is not as obvious as for the ones mutating four times.

Asparagine (two codons), aspartic acid (two codons), cysteine (two codons), histidine (two codons), isoleucine (three codons), methionine (one codon), phenylalanine (two codons) and tryptophane (one codon) have no synonymous mutations. As methionine and tryptophane have only one codon a synonymous mutation is not possible. For all the other cases one can see that there are at most two codons which encode the corresponding amino acids. So there is only one possible mutation for the same amino acid.

The diagram shows us what we expected: Amino acids with more codons should also have more synonymous mutations, which occurs in our case. It is not that clear if we look for example at valine (four codons) and glutamic acid (two codons). There might be a mutation bias or the number of observed mutations is too small to see the effect strong enough.

Non-synonymous mutations

After analysing the synonymous mutations we want to analyse the non-synonymous mutations.

Figure 2: Diagram that shows which position in the codon was mutated in non-synnonymous mutations

Figure 2 shows how often each codon position was mutated for non-synonymous mutations. You can see that the first and the second position show almost the same frequency whereas the third position is mutated in less than twenty cases. The reason for that is that if the third position of the codon mutates the probability that it is a synonymous mutation is much higher than for the first or second codon position. In contrast a mutation on the first or second position of the codon causes a mutation to another amino acid in the majority of the cases.

Figure 3: Diagram that shows which amino acids where mutated in non-synonymous mutations

Figure 3 shows which amino acids were mutated in non-synonymous mutations and how often. Arginine, glycine, leucine and proline are mutated most frequently and cysteine, methionine and histidine fewest. The reason could be a mutation bias towards cytosine and guanine, which occur more frequently in the codons of arginine, glycine, leucine and proline.

Figure 4: Diagram that shows which amino acids occured because of non-synonymous mutations

Figure 4 shows which amino acids occured how often because of mutation. Arginine, leucine, proline and also the termination codon occured most often. Methionine, tyrosine, tryptophan were not often the result of a mutation. The reason could be that the amino acids that occured most often have more possible codons whereas for example methionine has only one. So the probability is higher that after a mutation the codon encodes an amino acid, which has many codons encoding it. Interestingly also the termination codon occured very often due to mutation. There are three possible codons, so the number is also high.

Figure 5: Heatmap that shows how often one amino acid was mutated to another

Figure 5 shows a heatmap where amino acid replacements with high frequency are marked red and amino acid replacements which did not occur are marked white. Mutations from the same amino acid to itself are not marked in the heatmap because they are discussed in the section above. The ones which occured with the highest frequency are:

  • Asn -> Lys (4 times)
  • Gln -> Term (4 times)
  • Glu -> Lys (4 times)
  • Ile -> Thr (4 times)
  • Leu -> Pro (10 times)
  • Phe -> Val (4 times)
  • Pro -> Leu (8 times)
  • Thr -> Ile (4 times)
  • Trp -> Term (5 times)
  • Val -> Leu (6 times)
  • Tyr -> Cys (5 times)

You can see that the exchange Leucine to Proline or Proline to Leucine occurs very often. There seems to be a bias towards this mutation. As the codons are very similiar, this also could be a reason.

There are also mutations which only occur often in one direction. An example for that is valine to leucine. Four different codons encode valine (GU[AUCG]) and also four different codons encode leucine (CU[AUCG]), they only differ in the first position. The mutation guanine to cytosine seems to be more common than cytosine to guanine.

Tryptophan mutated five times to a termination codon. The reason may also be the very similar codon. For tryptophan it is UGG and for the termination codon UGA, UAG and UAA. For UGA and UAG a mutation at the second or third position in the codon of tryptophan results in a termination codon.

Another frequent mutation is tyrosine to cystein. This is another example for a mutation which occurs frequently into one direction. Here again the mutation of one codon position is sufficient as tyrosine has the codons UAC and UAU and cytosine the codons UGU and UGC. There seems to be a bias towards the mutation of arginine to guanine.

On the other hand you can look at mutations that do not occur. For example alanine to arginine or tyrosine to histidine etc. For these replacements two mutations would be necessary. So the probability is lower. Of course there is also a correlation between the number of codons and the mutations. Methionine has only one codon so there are fewer possible mutations.

Conclusion

All in all we have two conclusions. The more codons encode an amino acid, the more mutations it has, the fewer codons encode an amino acid, the fewer mutations occur. And: The more similar the codons of different amino acids are, the more probable and frequent is a mutation. But there must also be something like a mutation bias because sometimes this is only true for one direction.

References

<references/>