Question

Harvesting Known Disease Mutations

8

Entering edit mode

12.6 years ago

Stephane Plaisance ▴ 460

Hi All!

I try to collect known disease causative mutations with full genome coordinate and call information to build a golden standard (and search the obtained list against my full genome data) - BED format is my target to implement bedtool or galaxy on top.

A general comment: why are BED, GFF, or similar shared format not supported by public databases as standard DL format???

I found, with help of colleagues, several sources of disease mutations including:

OMIM variants extracted by Omicia and provided as a track (OMICIA_auto) on the next release of UCSC tables (http://genome-preview.ucsc.edu/...)
COSMIC rev54 (now 55 since a couple of days) DL as a text table I had to convert to BED with some perl magic (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic)
dbSNP was not an easy catch and I am still struggling to get the full information from their difficult batch download system (only feasible through ensembl BIOMART so far: [tip: hg18 BIOMART is at:http://may2009.archive.ensembl.org/biomart/martview/]). For dbSNP, I searched for records with phenotype (thanks to another colleague) which is the only available annotation to pick disease variants but in fact includes many association results which are far from being causative .

REM: As you could notice, I still work with hg18|Build36 but more recent data would do as well with some liftover. If someone has other sources, it would be great to share as this is likely a common request for people willing to mine in patient full genomes.

Cheers,

Stephane

disease mutation variant human • 9.5k views

ADD COMMENT • link updated 9.2 years ago by Biostar 20 • written 12.6 years ago by Stephane Plaisance ▴ 460

2

Entering edit mode

I wasn't aware of OMICIA, thanks.

dbSNP isn't really a disease database, it just contains variants. These are almost entirely variants associated with normal healthy humans. Despite it being a nonstarter, you might find it easier to download it from the Broad: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/5974/hg19/dbsnp_132.hg19.vcf.gz or similar

Also, much of COSMIC isn't disease causative, but that's your call

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Russh ★ 1.2k

Ram · Answer 1 · 2011-09-16

For one, please see this BioStar question and my response with regard to collecting the clinically relevant SNPs in dbSNP.

Second, it seems that you are interested solely in SNPs, but "known disease mutations" in humans encompasses much more, from trisomy, to translocations (BCR-ABL and leukemia) to triplet repeat extension (Huntington disease, e.g.) and telomere shortening. Maybe you already have these from OMIM. If not, I would broaden my OMIM search to grab these larger-sized variants, too.

Third, there are emerging datasets from the whole genome sequencing of tumor vs normal samples. These efforts uncover numerous variants but few have been linked definitely to the disease itself. The variants are present but not known as causative. Nonetheless, you could collect these and annotate them as "bronze standard" until they pass some threshold, say as occurring in x% of samples examined, or member of pathway X which is aberrant in some significant percentage of samples examined.

Fourth, don't neglect the GWAS catalog at genome.gov. These may be less than "gold" but could be if shown in replication/validation studies to again associate with the phenotype. But here you need to distinguish between disease risk (high LDL cholesterol) and actual heart disease (say, myocardial infarction).

Fifth, there are also a few cases of two SNPs acting in concert. This is best exemplified by APOE epsilon-4 alleles. One SNP by itself is not really associated with the disease (Alzheimer) or disease risk (elevated blood cholesterol), but both together. That can be difficult to code in a relationship table.

Good luck! Seems like a cool project and a worthy resource.

Added in edit on 19 Sep 2011: From a position paper in development: The Human Variome Project is the global initiative to collect, curate and share information on all genetic variations effecting human disease. Through the standardised collection and sharing of variant data amongst the global community, the Human Variome Project seeks to reduce the burden of genetic disease on the human population.

In addition, the Human Genome Variation Society has links to mutation databases that may be relevant to your project's goals.

Edit added 13 Oct 2011: I have just learned from following the International Congress of Human Genetics meeting on Twitter that Rong Chen is painstakingly manually curating 5,478 disease-SNP association papers and adding the info to a database of 67,678 SNPs associated with 1,563 diseases.

score 4 · Answer 2 · 2011-09-16

4

Entering edit mode

12.6 years ago

User 59 13k

You could also include the public/academic version of HGMD?

http://www.hgmd.org/

ADD COMMENT • link 12.6 years ago by User 59 13k

score 4 · Answer 3 · 2011-09-16

Another disease mutation source is SwissVar which contains missense mutations on Swiss-Prot proteins. Be sure to check the mutation classification: either Unclassified, Polymorphism, or Disease. You'll find a lot of overlap with the OMIM mutations, but there are mutations unique to this set as well. However, I haven't seen that the mutations are available in BED or GFF format.

The ICGC data portal is another source of somatic mutations from caner sequencing studies. As noted in a previous answer, the mutations will contain a mix of causal "driver" mutations and neutral "passenger" mutations. A few specialized predictor tools like mCluster, CanPredict, and CHASM can help distinguish driver and passenger mutations.

Ram · Answer 4 · 2011-09-16

4

Entering edit mode

12.6 years ago

ff.cc.cc ★ 1.3k

Only for completeness... I found useful the following links suggested by genecards in the "disorder" section:

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by ff.cc.cc ★ 1.3k

score 4 · Answer 5 · 2011-09-19

4

Entering edit mode

12.6 years ago

Khader Shameer 18k

Recently heard about ClinVar resource, an upcoming resource focused on clinical/disease/pharmacological/GWAS related mutations from NCBI. Please check this intro for more details.

ADD COMMENT • link 12.6 years ago by Khader Shameer 18k

0

Entering edit mode

Another NCBI resource that may be of interest is PheGenI (Phenotype- Genotype Integrator) (http://www.ncbi.nlm.nih.gov/gap/PheGenI)...though it is still under development.

ADD REPLY • link 12.6 years ago by Dpsguy ▴ 140

Ram · Answer 6 · 2011-09-18

3

Entering edit mode

12.6 years ago

Dpsguy ▴ 140

Also check out the following:

GWAS catalog

SNPedia

You may also find this discussion informative: Disease Associated Snps

BTW I am also trying to make a database of SNPs associated with age- related disorders, so I find your work interesting. Your resource would be very valuable!

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Dpsguy ▴ 140

score 3 · Answer 7 · 2011-09-27

You can also try PhenomicDB - http://www.phenomicdb.de - it's a free multi-organism phenotype-genotype database unifying a variety of primary sources to provide a wide range of reported genotype-phenotype relationships in one single database and make them simultaneously searchable, visible and comparable. The reported phenotypes are most often diseases, and the phenotypes/diseases in each entry are always related to a particular gene/genotype. The description details (both of the gene, and the phenotype) within each entry provide mutations information if available. You can make your search on the start page by both a gene of interest or a disease of interest, select an organism or make a parallel search between several organisms, select specific fields where the search to be made, you can even customize your results table to show only the columns of interest. The phenotypic data clusters mapped to each entry could help you further analize similar phenotypes/diseases caused by different genes or mutations. The gene ortgology information could help you suggest a known phenotype/disease to a new and/or orphan genotype/mutation. If you have questions or need a support, don't hesitate to ask.

Ram · Answer 8 · 2011-09-18

Thanks to all of you who answers and provided many links.

I will 'briefly' comment on some of your posts (take a cup of tea and relax ;-) )

Important about my top comment, when I ask for BED export, it is not just the coordinates I would like to get but also the ref and call alleles, the effect on codon when translated, the ID of the reference transcript (when transcribed), the target gene symbol ... all those precious things one will need to identify the variation at sequence level. Often this information is there but never in the same format and sometimes partial (no ref allele provided for instance)

Here is an example of what I would like to get in the BED (from my dbSNP reformat)

chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000371002|||INTRONIC|Inflammatory bowel disease      +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000408806|||UPSTREAM|Inflammatory bowel disease      +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000395227|c.377G>A|p.126R>Q|NON_SYNONYMOUS_CODING|Crohn's Disease    +
chr1    67478545        67478546        rs11209026|G>A|IL23R|ENST00000395227|c.377G>A|p.126R>Q|NON_SYNONYMOUS_CODING|Inflammatory bowel disease +
rs10492972|T>C|KIF1B|ENST00000355249|||INTRONIC|Multiple Sclerosis  +

Answers to everyone above

Thanks RussH for the broad link especially the liftover back to hg18 seems interesting (if annotations are rich)
Daniel: I got access to HGMD few days ago which would be the perfect solution if I could batch download its content (I could not see a way to do it and obviously this would not favor their commercial model). Browsing variants one at a time is fine to control few variants but not to use this facility as a filter for whole genomes (please correct me if I was wrong here).
Nathan: the mixture is a problem for my purpose, please read below but thanks for the link (i'll check it)
Larry: Your 'Clinically-associated SNP's' is a real interesting one too. I will have a very close look at this as it may apply for me (pathologic records). The other points are great as well but I really need gold, this is not a project in-se but a tool to quickly recall known causative in a panel of full genomes.
ffcccc: thanks for this links, few surfing hours in sight.

THANKS you all so much for sharing your knowledge, and thanks BIOStar for this great platform.

more comments:

Many of the links point to valuable data collected from GWAS or from predictions. This is very nice when one wants a large coverage at the cost of confidence. It would be indeed a great and valuable resource to have all these things at one place and cross referenced like STRING did for PPIs. BTW: I am willing to share my bed files with anyone interested (but without guaranty for the content)

However I would like to collect only demonstrated driver mutations (to use the cancer terminology) and many of the reported variants are associated with disease but not necessarily driver (or not clearly stated as such).

I therefore believe we should divide these sources in two categories:

variations associated with disease (I agree that they likely play their role in it)
variations directly causative and sufficient for disease phenotype

So far, I only could find OMIM (via Omicia track) and COSMIC (via their flat download) to fit the second category (the one I really need).

After some work, I could also make a rich BED file from the BIOMART download of both 129/v54/hg18 and 132/v66/hg19 versions of dbSNP. This was quite some edit but ended up with 960 loci for hg18 and 68783 for hg19 (many variants in dbSNP130+ come from disease samples!). As pointed above, dbSNP is not purposely storing disease variants so that might not be the best source.

Cheers, Stephane