Mapping mutations of ARS A

From Bioinformatikpedia
Revision as of 15:16, 14 August 2011 by Kassner (talk | contribs)

Mutations in general

Mutations are changes in the genomic nucleotide sequence of an organism. These changes are accidentally introduced, e.g. if wrong bases are incorporated during DNA replication. The common types of mutations are insertions into the DNA, deletions from it or Nucleotide substitutions.
Depending on the type mutation on the DNA level, it might influence the structure or function of a protein in different extent.

  • Frameshift mutation: If an insertion or deletion of a sequence occurs - and if the length of this sequence is not divisible by 3 - the reading frame of the downstream protein sequence is shifted either by one or two nucleotides. This leads to a completely different translation of the mRNA into the protein and if the downstream regions is long the protein is likely to be dysfunctional. If the insertion/deletion is divisible by 3, the reading frame is not disrupted and the structural and functional effects on the protein might not be severe.
  • In a nonsense mutation the codon of an amino acid within the protein is changed, such that a premature stop codon arises. This leads to truncation of the downstream protein sequence. The protein is likely to be dysfunctional if the truncated sequence is long.
  • A missense mutation describes an alteration of the codon, such that the amino acid in the protein is changed. Depending on the properties and location of the mutated amino acid, changes in structure and function can have a more or less dramatic effect.
  • In a silent mutation, the mutated codon still encodes the same amino acid. Thus the amino acid sequence of the protein is not changed and no structural or functional alteration should be observed.

In the following, we will map known nonsense, missense and silent (= synonymous) mutations from the databases dbSNP and HGMD on the sequence and the structure of the lysosomal enzyme ARS A.

HGMD

The Human Gene Muation Database (HGMD) <ref> Krawczak M, Cooper DN: The human gene mutation database (HGMD). Genome Digest 3: 7-8, 1996. </ref> provides a comprehensive collection of mutations within human genes, that are associated with diseases. We used the protocol as described here to get all missense and nonsense mutations of ARSA. All mutations found were known to be associated with Metachromatic Leukodystrophy. The table of all 90 missense/nonsense mutations is depicted at the end of this section. Furthermore, we mapped all 90 mutations on the sequnece of ARSA and colored them in red to get an impression of the distribution of the mutations (see below). Together with the sequence and the location of the mutations, we marked important binding sites in the graphical illustration below. "*" are metal binding sites, "." are substrate binding sites and ":" is the active site. One can see, that these important functional sites are always near a known mutation, which are therefore likely to cause a misfunction of the enzyme. Furthermore, we can see that the disease causing mutations are rather uniformly distributed along the protein sequence.

>sp|P15289|ARSA_HUMAN
**
MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT
*
DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM
. : .
AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP
.
LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE
**
RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC
.
GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP

LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL

TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG

EDPALQICCHPGCTPRPACCHCPDPHA


Accession Number Codon change Amino acid change position
CM042298 cGAC-AAC Asp-Asn 29
CM990171 cGGC-AGC Gly-Ser 32
CM065974 CTG-CCG Leu-Pro 52
CM990172 CTG-CCG Leu-Pro 68
CM950092 CCG-CTG Pro-Leu 82
CM940096 CGG-CAG Arg-Gln 84
CM990173 tCGG-TGG Arg-Trp 84
CM940097 GGC-GAC Gly-Asp 86
CM990174 gCCC-GCC Pro-Ala 94
CM970109 AGC-AAC Ser-Asn 95
CM910049 TCC-TTC Ser-Phe 96
CM910050 GGC-GAC Gly-Asp 99
CM990175 GGC-GTC Gly-Val 99
CM970110 aGGA-AGA Gly-Arg 119
CM940098 cGGC-AGC Gly-Ser 122
CM980118 CTG-CCG Leu-Pro 135
CM940099 CCC-CTC Pro-Leu 136
CM990176 gCCC-TCC Pro-Ser 136
CM004461 tCGA-GGA Arg-Gly 143
CM990177 CCG-CTG Pro-Leu 148
CM970111 cGAC-TAC Asp-Tyr 152
CM962419 CAGg-CAC Gln-His 153
CM940100 GGC-GAC Gly-Asp 154
CM940101 CCC-CGC Pro-Arg 155
CM032834 CCC-CTC Pro-Leu 155
CM042299 cTGC-CGC Cys-Arg 156
CM940102 CCT-CGT Pro-Arg 167
CM940103 cGAC-AAC Asp-Asn 169
CM950093 TGT-TAT Cys-Tyr 172
CM910051 ATC-AGC Ile-Ser 179
CM032835 CTG-CAG Leu-Gln 181
CM950094 CAGc-CAC Gln-His 190
CM990178 gCCC-ACC Pro-Thr 191
CM990179 TGGc-TGA Trp-Term 193
CM950095 TAC-TGC Tyr-Cys 201
CM930039 GCC-GTC Ala-Val 212
CM050538 cTTC-GTC Phe-Val 219
CM930040 GCC-GTC Ala-Val 224
CM990180 cCAC-TAC His-Tyr 227
CM940104 cCCT-ACT Pro-Thr 231
CM940105 cCGC-TGC Arg-Cys 244
CM970112 CGC-CAC Arg-His 244
CM930041 cGGG-AGG Gly-Arg 245
CM034715 TTT-TCT Phe-Ser 247
CM970113 TCC-TAC Ser-Tyr 250
CM024340 gGAG-AAG Glu-Lys 253
CM960078 gGAT-CAT Asp-His 255
CM930042 ACG-ATG Thr-Met 274
CM074714 ACT-ATT Thr-Ile 279
CM993444 aGAC-TAC Asp-Tyr 281
CM023013 gACC-CCC Thr-Pro 286
CM990181 CGT-CAT Arg-His 288
CM940106 gCGT-TGT Arg-Cys 288
CM042300 cGGC-AGC Gly-Ser 293
CM044574 GGC-GAC Gly-Asp 293
CM042301 TGC-TAC Cys-Tyr 294
CM930043 TCC-TAC Ser-Tyr 295
CM980119 TTG-TCG Leu-Ser 298
CM990182 TGT-TTT Cys-Phe 300
CM032836 cTAC-CAC Tyr-His 306
HM060041 cGAG-AAG Glu-Lys 307
CM990183 GGC-GAC Gly-Asp 308
CM962420 GGC-GTC Gly-Val 308
CM930044 cGGT-AGT Gly-Ser 309
CM950096 CGA-CAA Arg-Gln 311
CM001061 GAGc-GAT Glu-Asp 312
CM970114 tGCC-ACC Ala-Thr 314
CM004546 TGGc-TGA Trp-Term 318
CM032837 cGGC-AGC Gly-Ser 325
CM990184 ACC-ATC Thr-Ile 327
CM065973 cGAG-TAG Glu-Term 329
CM940107 GAC-GTC Asp-Val 335
CM890013 AAT-AGT Asn-Ser 350
CM970115 AAGa-AAC Lys-Asn 367
CM940108 CGG-CAG Arg-Gln 370
CM940109 tCGG-TGG Arg-Trp 370
CM940110 CCG-CTG Pro-Leu 377
CM930045 cGAG-AAG Glu-Lys 382
CM970116 cCGT-TGT Arg-Cys 384
CM980120 CGG-CAG Arg-Gln 390
CM940111 gCGG-TGG Arg-Trp 390
CM910052 ACT-AGT Thr-Ser 391
CM980121 tCAC-TAC His-Tyr 397
CM065972 cAGT-GGT Ser-Gly 406
CM012065 ACC-ATC Thr-Ile 408
CM940112 ACT-ATT Thr-Ile 409
CM990185 gCCC-ACC Pro-Thr 425
CM940113 CCG-CTG Pro-Leu 426
CM970117 CTC-CCC Leu-Pro 428
CM032838 TAT-TCT Tyr-Ser 429
CM990186 GCC-GTC Ala-Val 464
CM034716 GCT-GGT Ala-Gly 469
CM940114 gCAG-TAG Gln-Term 486
CM044573 cTGT-GGT Cys-Gly 489

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) <ref> Wheeler DL, Barrett T, Benson DA, et al. (January 2007). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Res. 35 (Database issue): D5–12. </ref> is an archive for genetic variation within and across different species. Again we used the protocol described here to search the database for known mutations of ARSA.
The "SNP" search for ARSA yielded 123 known human mutations for the protein. When we searched via the Geneview report for mutations, only 14 mutations appeared. We wondered why the first search yielded much more results than the Geneview report. Thus, we investigated the results from the "SNP" search in more deatil and noticed, that the "SNP" search yielded results from different isoforms and sequence versions of ARSA. Also there were a lot of insertions and deletions, that we did not want to consider. Therefore we selected the Geneview report for our isoform, that we used so far (ID=NP_000478) and proceeded with the analysis.
Again, we summarized hte results graphically. Unlike in the first section for HGMD, we also selected synonymous mutations from dbSNP (which are colored in green). This time we had much less mutations than in the analysis with HGMD and as one can see the mutations are not necessarily near important functional sites in the protein. This is maybe because dbSNP does not specifically stores disease-associated mutations. A table, containing all mutations is depicted at the end of this section.

>sp|P15289|ARSA_HUMAN
**
MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT
*
DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM
. : .
AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP
.
LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE
**
RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC
.
GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP

LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL

TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG

EDPALQICCHPGCTPRPACCHCPDPHA

SNP ID SNP type nucleotide (mutation) amino acid (mutation) nucleotide (reference) amino acid (reference) position
rs6151428 missense A His [H] G Arg [R] 496
rs117341984 missense A Arg [R] G Gly [G] 447
rs6151427 missense G Ser [S] A Asn [N] 440
rs6151425 synonymous T Asp [D] C Asp [D] 381
rs6151422 missense G Val [V] T Phe [F] 356
rs113990230 synonymous C His [H] T His [H] 206
rs62001867 missense A Thr [T] G Ala [A] 205
rs34457249 synonymous T Pro [P] C Pro [P] 195
rs6151415 missense T Cys [C] G Trp [W] 193
rs113209108 synonymous T Ser [S] C Ser [S] 186
rs6151412 synonymous T His [H] C His [H] 151
rs60504011 missense G Ala [A] C Pro [P] 136
rs6151411 missense G Leu [L] C Pro [P] 82
rs6151410 missense T Gly [G] C Gly [G] 79

Comining dbSNP and HGMD

We combined all 104 mutations (snynonymous, missense and nonsense) from both databases. We did not need to align the sequences, because both database used the same sequence version and positions perfectly corresponded to our sequence of ARSA. The overlap between both databases is very low. Only 3 positions show up in both results:

  • Position 193: The mutations are different. In HGMD, the mutation results in a premature stop codon, thus the main part of the whole protein is truncated. In dbSNP, there is a amino acid substitution (W -> C).
  • Position 136: The mutations are different amino acid substitutions. P -> L is annotated in HGMD and P -> A is annotated in dbSNP.
  • Position 82: Is mutation is identical in both databases and leads to a substitution: P -> L.

We again visualized the distribution of the mutation along the sequence. Synonymous substitutions are depicted in green, missense and nonsense mutations are depicted in red:

>sp|P15289|ARSA_HUMAN
MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFT
DFYVPVSLCTPSRAALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGM
AGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQGPCQNLTCFPPATPCDGGCDQGLVPIP
LLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASHHTHYPQFSGQSFAE
RSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNGPETMRMSRGGCSGLLRC
GKGTTYEGGVREPALAFWPGHIAPGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSP
LLLGTGKSPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQGSAHSDTTADPACHASSSL
TAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARG
EDPALQICCHPGCTPRPACCHCPDPHA

Furthermore, we visualized the mutations on the 3-dimensional structure of the protein:

Structure of ARSA. Synonymous mutations are shown in green, missense/nonsense mutations in red. The active site is depicted in yellow.

Summary Satistics

In this section we shortly want to analyse the mutation frequencies of the amino acids. First, we counted for all 20 amino acids, how often they are mutated, regarding to the above generated mutation map. The Figure below shows the results.

The Figure shows the number of mutations in the reference sequence for each amino acid.

Gly, Pro and Arg show the highest mutation freqeuncy. All of these amino acids show very distinct physico-chemical properties. We expect these to be overrepresented in our map, because we are mostly looking at disease-causing mutations.

  • Glycine is the smallest amino acid. Replacing it by any other bigger amino acid might cause structural chnages to the protein.
  • Proline is unique due to its ring structure, which enables the amino acid to disrupt secondary structure elements, which causes structural changes and might therefore affect the function of the protein.
  • Arginine is the most hydrophobic amino acid, with an Hydropathy index of -4.5. <ref>Kyte J, Doolittle RF (1982). "A simple method for displaying the hydropathic character of a protein". Journal of Molecular Biology </ref> Here, an amino acid substitution changes the behaviour in a waterous environment.

Next, we wanted to have a look at the frequencies of all substitutions. To achieve that, we calculated for each amino acid pair, the number of observed mutations in our combined mutation map. The Following Figure visualizes these counts:

The Figure shows for each amino acid pair, the number of observed mutations in the above generated combined map (HGMD, dbSNP).

Now let's consider and analyse the two most frequent mutations in our map. As we are mostly looking at disease causing mutations, we expect mutations between amino acids with very different physico-chemical bahaviour to be most abundant.

  • The most observed mutation from the map is Leu -> Pro. Also, Pro -> Leu is is very frequent. Leucine is a hydrophobic amino acid, whereas Proline is rather hydrophilic. Furthermore, introduction/removal of Proline to/from a structure might cause a severe strucutral change to the whole protein. This is because due to its unqiue ring structure, Protline is able to disrupt helical structures. Thus, introduction of Proline might disrupt secondary structure, resp. removel might introduce new structural elements.
  • Another frequent mutation is Asp->Gly. Asparagine is the very tiny residue Glycine replaces the bulky Asparagine. Furthermore Asparagine is rather hydrophilic due to its polarity, wehereas Glycine is aliphatic.

References

<references/>