Difference between revisions of "Mapping SNPs HEXA"
(→Statistical comparison of HGMD and DBSNP) |
(→Statistical comparison of HGMD and DBSNP) |
||
Line 436: | Line 436: | ||
Looking at the DBSNP result which were not in HGMD we can see that there are many amino acids which do not mutate. A reason for this is that the number of found mutations which are only on SNPDB and which are not silent is very low and there for not really significant. The most common mutated amino acid is here Asparagine. A possible reason can be that this amino acid is encoded by less triplets and in relation to the other tables it occures here very often by chance. |
Looking at the DBSNP result which were not in HGMD we can see that there are many amino acids which do not mutate. A reason for this is that the number of found mutations which are only on SNPDB and which are not silent is very low and there for not really significant. The most common mutated amino acid is here Asparagine. A possible reason can be that this amino acid is encoded by less triplets and in relation to the other tables it occures here very often by chance. |
||
− | For the silent mutations we also extract the amino acid where a nucleotide exchange takes places. |
+ | For the silent mutations we also extract the amino acid where a nucleotide exchange takes places..... |
[[Image:barplot_aa.png|center|1200px|Figure 1: Barplot for the frequency of a certain amino acid mutation for the different tables]] |
[[Image:barplot_aa.png|center|1200px|Figure 1: Barplot for the frequency of a certain amino acid mutation for the different tables]] |
Revision as of 19:31, 17 June 2011
Contents
Methods
First of all, we had to parse the HGMD database and the DB-SNP database.
HGMD
We logged us in and searched for Tay-Sachs diseases and chose one entry of HEXA. In our case there were two entries with identical content. We only looked at the missense/nonsense mutations, which are 68 annotated in HGMD. We just copied the webpage in a textfile and than wrote a short parser, which parse the codon change, amino acid change and codon number.
DBSNP
It was more complicated to parse the DBSNP output, than the output of HGMD. First of all, we search for HEXA in this database and chose only the SNPs which occur in human. We used the grapical output and again copied and pasted the page in a textfile. An entry in DBSNP has following structure: First of all there is the name of the SNP. Next the sequence and the graphical representation of the sequence. In the next line, there are the allel origin, the clinical relevance and last the annotations of the mutation. TBA...
Comparison of mutations in HGMD and SNP-DB
Mutations annotated in both databases
mutations which are not silent and cause a phenotype
SNP-DB Identifier | Codonposition | Mutationposition | Amino Acids | Codons |
rs121907964 | 26 | 3 | Trp -> TER | TGGc -> TGA |
rs121907979 | 39 | 2 | Leu -> Arg | CTT -> CGT |
rs121907975 | 127 | 1 | Leu -> Phe | aCTC -> TTC |
2 | Leu -> Arg | CTC -> CGC | ||
rs121907962 | 137 | 1 | Arg -> TER | cCGA -> TGA |
rs121907972 | 170 | 1 | Arg -> Trp | cCGG -> TGG |
rs121907957 | 2 | Arg -> Gln | CGG -> CAG | |
rs28941770 | 178 | 2 | Arg -> His | CGC -> CAC |
2 | Arg -> Leu | CGC -> CTC | ||
rs121907953 | 1 | Arg -> Cys | tCGC -> TGC | |
rs121907969 | 180 | 3 | Tyr -> TER | TACc -> TAG |
rs28941771 | 1 | Tyr -> His | tTAC -> CAC | |
rs121907973 | 197 | 2 | Lys -> Thr | AAA -> ACA |
rs1800429 | 200 | 1 | Val -> Met | cGTG -> ATG |
rs121907976 | 204 | 2 | His -> Arg | CAT -> CGT |
rs121907961 | 210 | 2 | Ser -> Phe | TCC -> TTC |
rs121907974 | 211 | 2 | Phe -> Ser | TTC -> TCC |
rs121907970 | 247 | 1 | Arg -> Trp | aCGG -> TGG |
rs121907959 | 250 | 1 | Gly -> Ser | gGGT -> AGT |
2 | Gly -> Asp | GGT -> GAT | ||
2 | Gly -> Val | GGT -> GTT | ||
rs121907971 | 258 | 1 | Asp -> His | tGAC -> CAC |
rs121907954 | 269 | 1 | Gly -> Ser | aGGT -> AGT |
2 | Gly -> Asp | GGT -> GAT | ||
rs121907977 | 301 | 2 | Met -> Arg | ATG -> AGG |
rs121907967 | 329 | 2 | Trp -> TER | TGG -> TAG |
rs121907963 | 393 | 1 | Arg -> TER | gCGA -> TGA |
rs121907958 | 420 | 3 | Trp -> Cys | TGGt -> TGC |
3 | Trp -> Cys | TGGt -> TGT | ||
rs28940871 | 451 | 1 | Leu -> Val | tCTG -> GTG |
rs121907978 | 454 | 2 | Gly -> Asp | GGT -> GAT |
1 | Gly -> Ser | tGGT -> AGT | ||
rs121907981 | 474 | 3 | Trp -> Cys | TGGc -> TGC |
rs121907952 | 482 | 1 | Glu -> Lys | cGAA -> AAA |
rs121907968 | 485 | 1 | Trp -> Arg | gTGG -> CGG |
rs121907966 | 499 | 1 | Arg -> Cys | aCGT -> TGT |
rs121907956 | 2 | Arg -> His | CGT -> CAT | |
rs28942071 | 504 | 1 | Arg -> Cys | cCGC -> TGC |
rs121907955 | 2 | Arg -> His | CGC -> CAC | |
rs4777502 | 506 | 3 | Glu -> Asp | GAA -> GAC |
3 | Glu -> Asp | GAA -> GAT |
Graphical representation:
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWP!PQNFQTSDQRYVRYPNNFQFQYDVSSAAQPGCSVLD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLFLSETVWGAL!GLETFSQLVWKSAEGTFFINKTEIEDFPRFPHWGLLLDTSHH!LPLS
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNTLNMFHWRLVDDPFSPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
AWLRSIRVLAEFHTPGHTLSWGPSIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFRSTFFL
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTC!KSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVW!EDIPVNYMKELELVTKAGFRALLSAPCYLNRISYG
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYIVEPLAFEGTPEQKAVVIDGEACMWGEYVDNTNLVPRLCPRAGAVAKRLRSNKL
TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYECLSHFCCDLLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation
Mutations annotated only in HGMD
not found
Mutations annotated only in SNP-DB
mutations annotated only in SNP-DB and not silent (pos in codon ist annotiert und sicher):
Here we list all mutations which are annotated in the SNP-DB, not silent and not annotated in the HGMD database. Some of these mutations have a really detailed NP annotation, other mutations do not have a very detailed annotation and therefore we had to map these mutations. The detail list of the mutations, which are mapped by us can be found [here].
SNP-DB Identifier | Codonposition | Mutationposition | Amino Acids | Codons |
rs4777505 | 29 | 2 | Asn -> Ser | AAC -> AGC |
rs61731240 | 179 | 1 | His -> Asp | CAT -> GAT |
rs3743230 | 208 | 1 | Asn -> Asp | AAC -> GAC |
rs61747114 | 248 | 1 | Leu -> Phe | CTT -> TTT |
rs1054374 | 293 | 2 | Ser -> Ile | AGT -> ATT |
rs1800430 | 399 | 1 | Asn -> Asp | AAC -> GAC |
rs1800431 | 436 | 1 | Ile -> Val | ATA -> GTA |
rs121907982 | 456 | 2 | Tyr -> Ser | TAT -> TCT |
Graphical representation:
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWPWPQSFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRDYLPLS
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
ARFRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPILNNTYEFMSTFFL
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVDYMKELELVTKAGFRALLSAPWYLNRISYG
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYVVEPLAFEGTPEQKALVIGGSACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation
silent mutation
Because this mutations are not annotated in the HGMD database, the mutations have to be silent and do not lead to a phenotype.
This mutations are badly annotated in the SNP-DB. Therefore we rotated the found codons, because the original codon has to code for the amino acid which occur in the protein sequence.
Therefore we used the codon with the mutation at position one, position two and position three. Next we also reverse it and made the complemetary sequence of both.
The detailed result can be seen [here]
If we found more than one nucleotide combination which code for the amino acid in the protein sequence, we list all of them in the following table, if they are silent mutations. Otherwise, we also listed all mutations which are not silent, if there do not exist any other possible mutation for this postition.
Here is the result, which combinations are possible:
SPN-DB Identifier | Codonposition | Mutationposition | Amino Acids | Codons | translation |
rs1800428 | 3 | 3 | Ser -> Ser | AGC -> AGT | Forward |
rs11551324 | 109 | 3 | Thr -> Thr | ACC -> ACT | forward |
rs28942072 | 324 | 3 | Val -> Val | GTT -> GGA | Forward |
rs28942072 | Val -> Val | GTC -> GTT | Forward | ||
rs34085965 | 446 | 1 | Pro -> Pro | CCT -> CCC | complemantry reverse |
rs4777502 | 506 | 3 | Glu -> Glu | GAG -> GAA | Forward |
Graphical representation:
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation
Summary
Graphical representation:
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLD
MTSSRLWFSLLLAAAFAGRATALWP!PQSFQTSDQRYVRYPNNFQFQYDVSSAAQPGCSVLD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
EAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVVTPGCNQLPTLESVENYTLTINDD
QCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLS
QCLFLSETVWGAL!GLETFSQLVWKSAEGTFFINKTEIEDFPRFPHWGLLLDTSHD!LPLS
SILDTLDVMAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
SILDTLDVMAYNTLNMFHWRLVDDPFSPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEY
ARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEFMSTFFL
AWFRSIRVLAEFHTPGHTLSWGPSIPGLLTPCYSGSEPSGTFGPVNPILNNTYEFRSTFFL
EVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
EVSSVFPDFYLHLGGDEVDFTC!KSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYG
KGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNYMKELELVTKAGFRALLSAPWYLNRISYG
KGYVVWQEVFDNKVKIQPDTIIQVW!EDIPVDYMKELELVTKAGFRALLSAPCYLNRISYG
PDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKL
PDWKDFYVVEPLAFEGTPEQKAVVIDGSACMWGEYVDNTNLVPRLCPRAGAVAKRLRSNKL
TSDLTFAYERLSHFRCELLRRGVQAQPLNVGFCEQEFEQT TSDLTFAYECLSHFCCELLRRGVQAQPLNVGFCEQEFEQT Non-silent mutation Silent mutation Wrong AA in mutation annotation
Statistical comparison of HGMD and DBSNP
For the analysis of the two different database results we decided to do some statistical comparison.
First of all we compared the different resulting tables (see above) according to their mutationposition in the triplet. Therefore we created a barplot which shows the precentage of the frequency for the corresponding mutationposition.
For the first case where the overlapping results of both databases. One can see here that there are as much mutations at the first postion as at the second position. The occurance of a mutation on the third position deviates from the others which means it is much rarer. The reason for this ist that the database HGMD contains no silent mutations which means that the overlap of both databases do not contain them as well. The third position of a triplet often causes a silent mutation and there for only a few mutations on the third position result in an amino acid change. Therefore this explains why the third position is as rare while the other positions are equal frequent.
The second case displays the position frequency of the corresponding mutations which are only resulting in DBSNP. Here one can see that mutations at position one are more frequent than at position two and that there are no mutations at position three. The resulting mutations only for DBSNP are not silent which can explain why there is no mutation at position three: mutations at the third position of a triplet are very often silent. Besides there is probably no special reason why the first position is more common than the second one for a mutation.
The third case shows the resulting silent mutations of DBSNP. Here the most frequent mutation position is the third one while the other ones are same common. This is the opposite behaviour comparing to the other cases and has similar explantion: mutations at the third position of a triplet often result in a silent mutation where contrary silent mutations at the other position are very rare.
The last case represents the total distribution of the mutationpositions. Here the first position is the most common for a mutation followed by the second position which is almost same common. The third position is the least frequent one. This is the expected result corresponding to the other three cases: the first and the second position are almost always the most frequent ones and the third the rarest. In the third case there is an exception which is the reason why the difference is not so high in the total distribution.
In summary, the barplot of the different tables/cases correspond to the expectation and can be all explained logically.
As a next step we looked up which amino acid mutates most often for each table. Therefore we create a barplot where the frequency for a mutation of a certain amino acid is displayed for each table. The different colors correspond to the differnt tables and the total distribution.
Looking at the overlaping result of both database we can see that almost every amino acid mutation occures. The only amino acids which do not mutate are Alanine, Asparagine, Cystein, Glutamine and Threonine. Three of these amino acid (Ala, Cys, Gln) do not occure in the other tables as well. One possible reason is that this amino acid were encoded only by very little possible triplets. Asparagine has probably the same reason which means that only few triplets encodes it. Threonine does probably mutate by accident, because it can be encoded by an higher number of triplets. The amino acids that mutate most common in the overlapping result of both databases are Arginine and Glycine. A possible reason for this is that Arginine can be encoded by many possible triplets as well as Glycine.
Looking at the DBSNP result which were not in HGMD we can see that there are many amino acids which do not mutate. A reason for this is that the number of found mutations which are only on SNPDB and which are not silent is very low and there for not really significant. The most common mutated amino acid is here Asparagine. A possible reason can be that this amino acid is encoded by less triplets and in relation to the other tables it occures here very often by chance.
For the silent mutations we also extract the amino acid where a nucleotide exchange takes places.....