I try to collect known disease causative mutations with full genome coordinate and call information to build a golden standard (and search the obtained list against my full genome data) - BED format is my target to implement bedtool or galaxy on top.
A general comment: why are BED, GFF, or similar shared format not supported by public databases as standard DL format???
I found, with help of colleagues, several sources of disease mutations including:
- OMIM variants extracted by Omicia and provided as a track (OMICIA_auto) on the next release of UCSC tables (http://genome-preview.ucsc.edu/...)
- COSMIC rev54 (now 55 since a couple of days) DL as a text table I had to convert to BED with some perl magic (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic)
- dbSNP was not an easy catch and I am still struggling to get the full information from their difficult batch download system (only feasible through ensembl BIOMART so far: [tip: hg18 BIOMART is at:http://may2009.archive.ensembl.org/biomart/martview/]). For dbSNP, I searched for records with phenotype (thanks to another colleague) which is the only available annotation to pick disease variants but in fact includes many association results which are far from being causative .
REM: As you could notice, I still work with hg18|Build36 but more recent data would do as well with some liftover. If someone has other sources, it would be great to share as this is likely a common request for people willing to mine in patient full genomes.
I wasn't aware of OMICIA, thanks.
dbSNP isn't really a disease database, it just contains variants. These are almost entirely variants associated with normal healthy humans. Despite it being a nonstarter, you might find it easier to download it from the Broad: ftp://email@example.com/bundle/5974/hg19/dbsnp_132.hg19.vcf.gz or similar
Also, much of COSMIC isn't disease causative, but that's your call