Researching SNPs TSD
Oh it was gorgeousness and gorgeosity made flesh. The trombones crunched redgold under my bed, and behind my gulliver the trumpets three-wise silverflamed, and there by the door the timps rolling through my guts and out again crunched like candy thunder. Oh, it was wonder of wonders. And then, a bird of like rarest spun heavenmetal, or like silvery wine flowing in a spaceship, gravity all nonsense now, came the violin solo above all the other strings, and those strings were like a cage of silk round my bed. Then flute and oboe bored, like worms of like platinum, into the thick thick toffee gold and silver. I was in such bliss, my brothers.
-A Clockwork Orange
The journal for this task can be found here.
- 1 Sequence mapping
- 2 Mutations
- 3 HGMD
- 4 dbSNP
- 5 SNPdbe
- 6 OMIM
- 7 SNPedia
- 8 Mutation map
- 9 Conclusion
- 10 References
The different databases use different sequences as basis for the indices of their SNP data. In the following, the reference protein sequence remains P06865, however all databases base their annotations on nucleotide sequences as well. While the final annotations will only be displayed, mapped onto the protein sequence, NM_000520.4 will be used as a nucleotide reference sequence in the background. This entry describes an mRNA of HEXA and is also referred to by the Uniprot entry of P06865.
HGMD lists NM_000520.3 as reference, which is a previous version of NM_000520.4 that was chosen as reference for this task. A Needleman-Wunsch pairwise sequence alignment between the two nucleotide sequences in the entries shows that there are two single nucleotide differences in the last third of the sequence and that the more current version of the entry is 117 nucleotides longer at the beginning of the sequence. Since this region is annotated to belong to an exon, the question remains whether this has an effect on the protein sequence. A short comparison shows that there is a single differing residue at position 436 where a Val in NM_000520.3 is substituted by an Ile in NM_000520.4. However since HGMD does not list a SNP at this position, this is not an issue.
dbSNP lists various identifiers most of them on nucleotide basis. The SNP on protein level however refers to the NP_000511 which has the same sequence and position numbering as P06865. As the protein mutation is only a small annotation within dbSNP it was additionally verified that every wildtype matched the amino acid at the reference sequence position. Since SNPedia only links to dbSNP and does not contain direct mutation information, the same applies here.
SNPdbe lists two sequence sources, the dbSNP identifier NP_000511 and UniProt entry P06865, thus all SNPs can be adopted and directly mapped.
Since no database clearly denotes that a SNP is not disease causing, an entry was considered non-disease causing when it was not annotated as disease causing. Since dbSNP, and therefore SNPedia as well, do not contain any information on this, all other databases were queried for a SNP of interest and if it was present and annotated as disease causing this annotation was transferred to the dbSNP/SNPedia entry. Clearly this is not a perfect solution, however the only thing possible since HGMD, OMIM contain only disease causing entries. A justification might be that it is rather unlikely that samples from a Tay-Sachs Disease patient are being analysed without knowing about their phenotype. On the other hand SNPs, that do not cause a mutation are simply of less interest and might be harder to find, since the number of non-affected persons is of course higher.
The Human Gene Mutation Database (HGMD) <ref name="hgmd">Stenson,P.D. et al. (2009) The Human Gene Mutation Database: 2008 update. Genome medicine, 1, 13.</ref> is freely available to non-profit organisations and academic users. This free version is updated with a delay of three years after inclusion in the database <ref name="hgmd_inclusiondelay">http://www.hgmd.cf.ac.uk/docs/disclaimer.html</ref>. Indeed the most recent mutation linked to HEXA was published in 2008. However in the whole database there seems to be a small number of entries with publication dates from 2010 to 2012 that are also available in the free version <ref name="hgmd_statistics">http://www.hgmd.cf.ac.uk/ac/hahaha.php</ref>, the exact mechanism is therefore not entirely clear.
HGMD is updated semi-automatically, amognst others, by screening the PubMed database. In contrast to other databases like dbSNP the same mutation is only recorded once and attributed to the publication that first mentioned it <ref name="hgmd"/>. The entries are not limited to SNPs but also include splice site changes, small and larger insertions and deletions as well as changes affecting regulation and complex rearrangements like inversions. Synonymous SNPs however are not recorded <ref name="hgmd"/>.
Currently HGMD contains 88745 entries.
A 3-day trial for the professional version of HGMD has been requested and has by now been received. In the following the more recent data of the professional version will be used and at some points, compare to the data available from the free version.
HGMD has a distinct subsite for HEXA and its missense/nonsense mutations. After exclusion of all nonsense mutations, 60 SNPs remain in the free version and 65 in the professional version. As to be expected, the five new entries are based on publications from 2009-2011. At the specific sites 58, 207, 259, 322 and 497, no other mutations have been known so far, making this highly valuable information. In addition three of these mutations are not found in any other database, while the most recent one C58Y is present in OMIM as well and Y497C from 2009 can also be found in dbSNP and snpdbe.
The Single Nucleotide Polymorphism Database dbSNP stores information on diverse DNA variations such as single base nucleotide substitutions and short deletion and insertion polymorphisms <ref name="dbsnpp">Wheeler DL, Barrett T, Benson DA, et al. (January 2007). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Res. 35 (Database issue): D5–12. DOI:10.1093/nar/gkl1031 </ref>. There are neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. This is also integrated with other NCBI genomic data. It was established by the National Center for Biotechnology Information and is now available as build 135 with a last update on Oct 2011. The database is designed to accept new submissions also in a large batch format. Up to now it consists of 292,031,791 submissions <ref name="dbSNPsummary"> http://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi </ref>.
A search for all mutations reveals 579 entries in the form of rs IDs. Some rs IDs are obsolete and redundant thus there are 526 unique and up to date SNPs alltogether for HEXA in homo sapiens. 406 of those are in an intron region, 14 mRNA utr, 14 nonsense or stop gained. There are 18 unique non-synonymous mutations and 51 unique missense mutations, only these two sets were extracted for further analysis. What kind of mutations the remaining 23 entries are remains unclear.
SNPdbe is a database of non-synonymous SNPs in the form of single amino acid substitutions (SAASs). It combines the data from dbSNP, SwissProt (including SwissVar), PMD and 1000 genomes. This webinterface was designed to provide combined annotation of experimentally derived functional and structural impact, predicted functional effect, associated disease, average heterozygosity, experimental evidence of the nsSNP and evolutionary conservation. The database currently holds 1,691,464 entries and was last updated in March 2012 <ref name="snpdbe"> C Schaefer, A Meier, B Rost, Y Bromberg (2012). SNPdbe: Constructing an nsSNP functional impacts database.Bioinformatics 28(4):601-602.</ref>.
Hex A mutations
The search for hexosaminidase A in human yields 76 entries. Of those 55 are present in dbSNP with protein identifier NP_000511 and 21 are solely from Swissprot or SwissVar with the accession number P06865.
18 snps are experimentally validated, see <xr id="snpdbeexp"/>. <xr id="snpdbecons"/> shows the distribution of wildtype and mutant amino acid conservation in form of the Perc score. 15 SNPs have a higher wildtype conservation than 50% and could be considered further for possible disease candidates analysis. The mutant conservation does not exceed 15% which is in accordance with the general perception that the mutant allele is very common.
OMIM, the Online Mendelian Inheritance in Man, provides comprehensive and referenced information on human genes and genetic phenotypes. It catalogs genetic disorders with collected details on the corresponding genetic locations <ref name="omimpaper">Hamosh, A.; Scott, A.; Amberger, J.; Bocchini, C.; McKusick, V. (2004). "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders". Nucleic Acids Research 33 (Database issue): D514–D517.</ref>. OMIM is stated to be updated daily. Up to now it is comprised of 21,266 entries. The entries aim at offering a multitude of external links to other genetic resources<ref name="omomwebsite">http://www.omim.org/</ref>.
OMIM contains a dedicated subsite for HEXA <ref name="omim_hexa">http://omim.org/allelicVariant/606869</ref> which lists 58 variants in total. Among these are entries that can be mapped to 36 unique rs-identifiers. After exclusion of all variants that are neither missense nor synonymous, 29 missense and 1 synonymous variation remain. All of these were already found by querying dbSNP.
Additionally, there are 19 entries that do not contain cross-references to dbSNP. Of these 18 are deletions, insertions or intronic variants, all of which are not relevant for this task. One mutation however, Cys58Tyr has to be added to the 29 missense mutations from above. Interestingly this mutation is not present in any of the other databases. The original publication <ref name="Najmabadi2011">Najmabadi,H. et al. (2011) Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature, 478, 57-63.</ref> lists this mutation in connection with a phenotype of mild Tay-Sachs disease. A reason that this entry is missing from all other databases might be the fairly new publication date. Indeed comparison to the professional version of HGMD shows that the mutation is found there as well.
SNPedia is a freely available wiki-based database of SNPs <ref name="snpedia">http://en.wikipedia.org/wiki/SNPedia (since SNPedia references this article themselves (http://www.snpedia.com/index.php/SNPedia:FAQ#What_is_SNPedia.3F) Wikipedia will be accepted as a citation)</ref>. The bulk of information are entries crosslinked to dbSNP via Rs-Identifiers, therefore anything in dbSNP can potentially be present in SNPedia as well. In addition there are pages for identifiers from 23andMe <ref name="23andme">https://www.23andme.com/</ref>, haplotypes and even pages describing information gained from complete genomes of specific people <ref name="snpedia_faq_onlysnps">http://www.snpedia.com/index.php/SNPedia:FAQ#SNPs_only.3F_what_about_CNVs.2C_indels.2C_inversions.2C_epigenetics_..._.3F</ref>.
Due to the nature of the wiki system every user is able to add information at any time. In addition, periodic updates based on text-mining are fed into the database as well <ref name="snpedia_nar">Cariaso,M. and Lennon,G. (2012) SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic acids research, 40, D1308-12.</ref>. Quality of the data is ensured on the hand by manual curation of users and editors and on the other hand by automated external programs that check for inconsistencies or missing information on a regular basis <ref name="snpedia_nar"/>.
At the time of writing SNPedia claims to contain 29135 SNPs <ref name="snpedia_faq_howmany">http://www.snpedia.com/index.php/SNPedia:FAQ#How_many_SNPs_are_in_SNPedia.3F</ref>, however this number was last update in August 2011 <ref name="snpedia_faq_history">http://www.snpedia.com/index.php?title=SNPedia:FAQ&action=history</ref>. Given the previous trends <ref name="snpedia_nar"/> it number should have significantly increased by now.
SNPedia does contain a dedicated subpage for TSD, however only few SNPs are listed there. More importantly SNPedia does not contain a dedicated page for HEXA. Therefore SNPs were searched with the query 'Gene = HEXA'. This results in 36 entries, most of which were last updated in Februrary 2012 by the automated SNPediaBot. One entry (Rs28940871) contains additional, user added information, all others are empty apart from the cross-references to other databases. During retrieveal one entry was already excluded, since it did not contain information of the change on the protein level in dbSNP. Additionally 6 more entries were excluded because they described neither missense nor synonymous mutations. There final set of SNPs from SNPedia therefore consists of 29 missense mutations and one synonymous mutation, all of which are also contained in the SNPs that were retrieved directly from dbSNP <ref name"snpedia_dbsnp_intersection">Intersection SNPedia and dbSNP</ref>.
Shown below is a map of all SNPs found in the databases previously discussed. The reference protein sequence is denoted in the middle line and highlighted in green. Disease causing SNPs are shown above this line in red, non-disease causing ones below in blue. Synonymous SNPs are highlighted in light-blue independent of the phenotype they are associated with.
100 L 100 T S N R Y 100 P06865 MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV 100 V SP P V L P D P L A 100 100 C 200 R W H 200 F G Q L H L ST M 200 P06865 TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSAEGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV 200 T Q Q R LH I A 200 200 S 300 D H S 300 GR E FS F T QV L HA D P S L 300 P06865 FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRGIRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF 300 D W W V C PI Q 300 R R 300 400 400 R V G V F R M M 400 P06865 MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQLESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY 400 P F V G G T D 400 400 500 S Q C 500 C DR Y N C K PR H 500 P06865 MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYIVEPLAFEGTPEQKALVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL 500 L V G L V P V I T A C 500 500 529 H 529 C 529 P06865 SHFRCELLRRGVQAQPLNVGFCEQEFEQT 529 D 529 E 529
The results from all databases are compared in <xr id="fig:venn"/> (for more detailed Information see <xr id="tbl:Overlapping SNPs"/>). For this comparison the synonymous SNP set from dbSNP (listed here) was neglected and only the missense results integrated. HGMD and SNPdbe are the only databases which have unique results whereby HGMD offers twice as much self-contained data as SNpdbe. Further on, they share a comparably large subset of 16 SNPs which are not listed anywhere else. 29 of the total 93 SNPs are present in all databases. Besides the subset of all sources the individual services do not overlap very much. A set of 19 mutations is only offered by dbSNP and thus also by SNPdbe. SNPedia and OMIM provide basically only the SNPs present in all other databases, apart from one mutation each.
The overlap between all databases with the free version of HGMD can be viewed here. It is interesting to note that of the 5 new SNPs 3 are available solely within HGMD, one is available in OMIM and one in dbSNP (and accordingly also in SNPdbe).
<figtable id="tbl:Overlapping SNPs">
|HGMD, dbSNP, SNPdbe, SNPedia, OMIM||29||F211S R247W S210F L39R W485R R499C R170W W474C D258H M1V R504C G454S E482K R499H G269S Y180H M301R R178H W420C R178L R170Q R504H L451V V200M H204R G250D K197T L127R R178C|
|HGMD, dbSNP, SNPdbe||3||S4P Y497C R249W|
|dbSNP, SNPdbe, SNPedia||1||I436V|
|HGMD, SNPdbe||16||R252H S226F S279P C458Y N196S V192L R252L V391M G455R G269D I335F D314V G250S P25S L127F R166G|
|dbSNP, SNPdbe||19||E214D S293I S279C L11P V376G I389T L183H N399D N43D P182L Q35P A13V Q45P P81A A410V A479T E506D S59L G343V|
|HGMD||16||A246T M1L L484P R249Q G250V D465N G454D Y37N W203G I388M F300L D322G D207E T259A M1T Q374R|
|SNPdbe||8||N295Q N157Q N115Q S331P L484Q A418G N295S F434L|
<xr <figure id="fig:dbcount"/> shows, for every position in the sequence, how many databases have a SNP annotated at this position. Therefore there can be a maximum value of 5 for each position. The figure shows that well known and less known mutations are scattered all over the sequence and not clustered in a special region. Also depicted are the enzymatically important residues. Of these residues, 178 and 207 have a mutant (or more, cf. <xr id="tbl:Overlapping SNPs"/>) annotated. Surprisingly, the active site residue E323 has no annotated SNP in any of the databases, however HGMD lists one SNP for its direct neighbour 322 which also has high importance in catalysis. Overall it can be observed that there is an accumulation of SNPs near the highlighted important resiudes. This makes sense, taking into account that all residues but D322/E323 were only considered to be important because they formed hydrogen bonds with the substitute ligand NGT. It is very conceivable that this is not perfectly correct and changes in the near vicinity of these residues can have similarly detrimental effects. <figure id="fig:aaprop">
<xr <figure id="fig:aaprop"/> shows the frequency of every amino acid on either the wildtype or the mutant position. Arginine clearly stands out as being a popular target for mutations. This is somewhat explained by the fact the Arg is a comparably refined amino acid with distinct chemical properties where an exchange is expected to have more sever effects than for a more simple amino acid like alanine. Arginine is also found at the important position 178. However this cannot completely explain the high number of Arg substitutions since other special amino acids like glutamic acid do not show comparable behaviour and subtitution frequencies for simple amino acids like Ala or Gly are not particularly low either. In fact variations are generally not very large with most amino acids falling in the range of 5 to 10 occurrences.
</figure> The final visualisation of all gathered information is displayed in <xr id="fig:structure"/>. The disease non-disease and synonymous SNPs are mapped onto the Hexosaminidase pdb structure, the color-coding is adapted from mutation map.
dbSNP as a subtool of the NCBI's large pool of various programs is very comprehensive and offers a variety of filters and search functions. The simple search for e.g. a specific gene is comparably intuitive and by the filter check box on the web page the results can be specified. Unfortunately the documentation on how to query the database by SQL syntax is buggy and lacks explanation. Further on, a direct or automated query is not possible as the result site is not accessible through a link (not RESTful). If the right results have been searched manually an automation of information extraction on the results from the individual pages turns out to be challenging as well. For this purposes an FTP server is provided but the documentation needs refinement and more referencing on several topics to make it more user friendly. The results from dbSNP however are satisfying. It is the only source which provides synonymous mutations as well as missense and, together with HGMD, constitutes the largest source on SNPs.
SNPedia's wiki system can in theory be very powerful, however for HEXA only a single entry contained additional information added by a human. Unfortunately the number of active users currently seems to be very low, limiting the possibilities that lie within the system <ref name="snpedia_active_users">http://www.snpedia.com/index.php/Special:Statistics</ref>. The lack of further information cannot be complemented by the SNPs present, because all of them can be found in dbSNP as well. On the other hand, OMIM also has very similar data and only one unique entry (that is if you do not have the professional version of HGMD in which the specific entry is contained), that is not present in any of the other databases. However OMIM contains a text with a summary and additional information for every entry. So while SNPedia might have the foundation to become superior through crowd-sourcing, currently the more closed system OMIM seems clearly better.
HGMD has contained the largest number of missense SNPs, drawbacks however are the closed system, that makes retrieval of the data harder than necessary and direct programmatic access to data virtually impossible without paying for the service. For research based on single genes of interest this is not an issue though.
The authors wish to be excused from the judgement of SNPdbe due to a possible conflict of interest.