Question: Harvesting Known Disease Mutations
gravatar for Stephane Plaisance
5.8 years ago by
Leuven area (Belgium)
Stephane Plaisance330 wrote:

Hi All!

I try to collect known disease causative mutations with full genome coordinate and call information to build a golden standard (and search the obtained list against my full genome data) - BED format is my target to implement bedtool or galaxy on top.

A general comment: why are BED, GFF, or similar shared format not supported by public databases as standard DL format???

I found, with help of colleagues, several sources of disease mutations including:

  • OMIM variants extracted by Omicia and provided as a track (OMICIA_auto) on the next release of UCSC tables (
  • COSMIC rev54 (now 55 since a couple of days) DL as a text table I had to convert to BED with some perl magic (
  • dbSNP was not an easy catch and I am still struggling to get the full information from their difficult batch download system (only feasible through ensembl BIOMART so far: [tip: hg18 BIOMART is at:]). For dbSNP, I searched for records with phenotype (thanks to another colleague) which is the only available annotation to pick disease variants but in fact includes many association results which are far from being causative .

REM: As you could notice, I still work with hg18|Build36 but more recent data would do as well with some liftover. If someone has other sources, it would be great to share as this is likely a common request for people willing to mine in patient full genomes.



disease mutation variant human • 5.4k views
ADD COMMENTlink modified 2.4 years ago by Biostar ♦♦ 20 • written 5.8 years ago by Stephane Plaisance330

I wasn't aware of OMICIA, thanks.

dbSNP isn't really a disease database, it just contains variants. These are almost entirely variants associated with normal healthy humans. Despite it being a nonstarter, you might find it easier to download it from the Broad: np_132.hg19.vcf.gz or similar

Also, much of COSMIC isn't disease causative, but that's your call

ADD REPLYlink written 5.8 years ago by Russh1.2k
gravatar for Larry_Parnell
5.8 years ago by
Boston, MA USA
Larry_Parnell15k wrote:

For one, please see this BioStar question and my response with regard to collecting the clinically relevant SNPs in dbSNP.

Second, it seems that you are interested solely in SNPs, but "known disease mutations" in humans encompasses much more, from trisomy, to translocations (BCR-ABL and leukemia) to triplet repeat extension (Huntington disease, e.g.) and telomere shortening. Maybe you already have these from OMIM. If not, I would broaden my OMIM search to grab these larger-sized variants, too.

Third, there are emerging datasets from the whole genome sequencing of tumor vs normal samples. These efforts uncover numerous variants but few have been linked definitely to the disease itself. The variants are present but not known as causative. Nonetheless, you could collect these and annotate them as "bronze standard" until they pass some threshold, say as occurring in x% of samples examined, or member of pathway X which is aberrant in some significant percentage of samples examined.

Fourth, don't neglect the GWAS catalog at These may be less than "gold" but could be if shown in replication/validation studies to again associate with the phenotype. But here you need to distinguish between disease risk (high LDL cholesterol) and actual heart disease (say, myocardial infarction).

Fifth, there are also a few cases of two SNPs acting in concert. This is best exemplified by APOE epsilon-4 alleles. One SNP by itself is not really associated with the disease (Alzheimer) or disease risk (elevated blood cholesterol), but both together. That can be difficult to code in a relationship table.

Good luck! Seems like a cool project and a worthy resource.

Added in edit on 19 Sep 2011: From a position paper in development: The Human Variome Project is the global initiative to collect, curate and share information on all genetic variations effecting human disease. Through the standardised collection and sharing of variant data amongst the global community, the Human Variome Project seeks to reduce the burden of genetic disease on the human population.

In addition, the Human Genome Variation Society has links to mutation databases that may be relevant to your project's goals.

Edit added 13 Oct 2011: I have just learned from following the International Congress of Human Genetics meeting on Twitter that Rong Chen is painstakingly manually curating 5,478 disease-SNP association papers and adding the info to a database of 67,678 SNPs associated with 1,563 diseases.

ADD COMMENTlink modified 5.7 years ago • written 5.8 years ago by Larry_Parnell15k
gravatar for Daniel Swan
5.8 years ago by
Daniel Swan13k
Earlham Institute, Norwich, UK
Daniel Swan13k wrote:

You could also include the public/academic version of HGMD?

ADD COMMENTlink written 5.8 years ago by Daniel Swan13k
gravatar for Nathan Nehrt
5.8 years ago by
Nathan Nehrt250
Baltimore, MD
Nathan Nehrt250 wrote:

Another disease mutation source is SwissVar which contains missense mutations on Swiss-Prot proteins. Be sure to check the mutation classification: either Unclassified, Polymorphism, or Disease. You'll find a lot of overlap with the OMIM mutations, but there are mutations unique to this set as well. However, I haven't seen that the mutations are available in BED or GFF format.

The ICGC data portal is another source of somatic mutations from caner sequencing studies. As noted in a previous answer, the mutations will contain a mix of causal "driver" mutations and neutral "passenger" mutations. A few specialized predictor tools like mCluster, CanPredict, and CHASM can help distinguish driver and passenger mutations.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Nathan Nehrt250
gravatar for
5.8 years ago by
European Union wrote:

Only for completeness... I found useful the following links suggested by genecards in the "disorder" section:

genetests ->

pharmGKB ->

HugeNavigator ->

geneatlas ->

GAD ->

ADD COMMENTlink written 5.8 years ago by
gravatar for Khader Shameer
5.8 years ago by
Manhattan, NY
Khader Shameer17k wrote:

Recently heard about ClinVar resource, an upcoming resource focused on clinical/disease/pharmacological/GWAS related mutations from NCBI. Please check this intro for more details.

ADD COMMENTlink written 5.8 years ago by Khader Shameer17k

Another NCBI resource that may be of interest is PheGenI (Phenotype- Genotype Integrator) ( it is still under development.

ADD REPLYlink written 5.8 years ago by Dpsguy140
gravatar for Dpsguy
5.8 years ago by
Dpsguy140 wrote:

Also check out the following:

GWAS catalog


You may also find this discussion informative:

BTW I am also trying to make a database of SNPs associated with age- related disorders, so I find your work interesting. Your resource would be very valuable!

ADD COMMENTlink written 5.8 years ago by Dpsguy140
gravatar for Ivanka Karageorgieva
5.8 years ago by
Ivanka Karageorgieva60 wrote:

You can also try PhenomicDB - - it's a free multi-organism phenotype-genotype database unifying a variety of primary sources to provide a wide range of reported genotype-phenotype relationships in one single database and make them simultaneously searchable, visible and comparable. The reported phenotypes are most often diseases, and the phenotypes/diseases in each entry are always related to a particular gene/genotype. The description details (both of the gene, and the phenotype) within each entry provide mutations information if available. You can make your search on the start page by both a gene of interest or a disease of interest, select an organism or make a parallel search between several organisms, select specific fields where the search to be made, you can even customize your results table to show only the columns of interest. The phenotypic data clusters mapped to each entry could help you further analize similar phenotypes/diseases caused by different genes or mutations. The gene ortgology information could help you suggest a known phenotype/disease to a new and/or orphan genotype/mutation. If you have questions or need a support, don't hesitate to ask.

ADD COMMENTlink written 5.8 years ago by Ivanka Karageorgieva60

Thanks Ivanka, this looks great, waiting on some export to see what I can make of it. Cheers!

ADD REPLYlink written 5.8 years ago by Stephane Plaisance330

When I will be more advanced, I will try to post a full list of links with some comparatives. I guess this would clarify this topic and serve as reference for others (would make a nice review paper actually ;-) ).

ADD REPLYlink written 5.8 years ago by Stephane Plaisance330
gravatar for Stephane Plaisance
5.8 years ago by
Leuven area (Belgium)
Stephane Plaisance330 wrote:

Thanks to all of you who answers and provided many links.

I will 'briefly' comment on some of your posts (take a cup of tea and relax ;-) )

Important about my top comment, when I ask for BED export, it is not just the coordinates I would like to get but also the ref and call alleles, the effect on codon when translated, the ID of the reference transcript (when transcribed), the target gene symbol ... all those precious things one will need to identify the variation at sequence level. Often this information is there but never in the same format and sometimes partial (no ref allele provided for instance)

Here is an example of what I would like to get in the BED (from my dbSNP reformat)

chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000371002|||INTRONIC|Inflammatory bowel disease      +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000408806|||UPSTREAM|Inflammatory bowel disease      +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000395227|c.377G>A|p.126R>Q|NON_SYNONYMOUS_CODING|Crohn's Disease    +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000395227|c.377G>A|p.126R>Q|NON_SYNONYMOUS_CODING|Inflammatory bowel disease +
rs10492972|T>C|KIF1B|ENST00000355249|||INTRONIC|Multiple Sclerosis  +

Answers to everyone above

  1. Thanks RussH for the broad link especially the liftover back to hg18 seems interesting (if annotations are rich)
  2. Daniel: I got access to HGMD few days ago which would be the perfect solution if I could batch download its content (I could not see a way to do it and obviously this would not favor their commercial model). Browsing variants one at a time is fine to control few variants but not to use this facility as a filter for whole genomes (please correct me if I was wrong here).
  3. Nathan: the mixture is a problem for my purpose, please read below but thanks for the link (i'll check it)
  4. Larry: Your 'Clinically-associated SNP’s' is a real interesting one too. I will have a very close look at this as it may apply for me (pathologic records). The other points are great as well but I really need gold, this is not a project in-se but a tool to quickly recall known causative in a panel of full genomes.
  5. ffcccc: thanks for this links, few surfing hours in sight.

THANKS you all so much for sharing your knowledge, and thanks BIOStar for this great platform.

more comments:

Many of the links point to valuable data collected from GWAS or from predictions. This is very nice when one wants a large coverage at the cost of confidence. It would be indeed a great and valuable resource to have all these things at one place and cross referenced like STRING did for PPIs. BTW: I am willing to share my bed files with anyone interested (but without guaranty for the content)

However I would like to collect only demonstrated driver mutations (to use the cancer terminology) and many of the reported variants are associated with disease but not necessarily driver (or not clearly stated as such).

I therefore believe we should divide these sources in two categories:

  • variations associated with disease (I agree that they likely play their role in it)
  • variations directly causative and sufficient for disease phenotype

So far, I only could find OMIM (via Omicia track) and COSMIC (via their flat download) to fit the second category (the one I really need).

After some work, I could also make a rich BED file from the BIOMART download of both 129/v54/hg18 and 132/v66/hg19 versions of dbSNP. This was quite some edit but ended up with 960 loci for hg18 and 68783 for hg19 (many variants in dbSNP130+ come from disease samples!). As pointed above, dbSNP is not purposely storing disease variants so that might not be the best source.

Cheers, Stephane

ADD COMMENTlink modified 3.2 years ago • written 5.8 years ago by Stephane Plaisance330

Stephane, I'm pretty sure HGMD has BED files for download, but this may only be available for their commercial offering, rather than the non-commercial/academic offering.

ADD REPLYlink written 5.8 years ago by Daniel Swan13k

Also HGMD public is integrated into ENSEMBL releases, so BioMart should allow you to access the data.

ADD REPLYlink written 5.8 years ago by Daniel Swan13k

Stephane, thank you for your feedback to the responses. It is great to see that and is something from which BioStar could benefit as it further informs. you may want to take a look at the links I added on 19 Sep 2011 - those could also be useful.

ADD REPLYlink written 5.8 years ago by Larry_Parnell15k
  • HGMD has indeed great tracks but they cost big$$$$$
  • your remark on Ensembl is right Daniel, but you will not get the allele from there, it has been masked (read on some other post) I need to try it myself using ens-perl-API.
  • I used BioMart already but the level of details is not optimal for my needs (the coordinates aa change are not given for instance with the latest hg18 build)
  • in short, there is a lot i different paces but nothing totally comparable and always fancy private annotations
  • the link below to PhenomicDB looks great, I need to look closer. Thanks All!! Stephane
ADD REPLYlink written 5.8 years ago by Stephane Plaisance330
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 766 users visited in the last hour