Question: Where Can I Find Data Sets Of Cancer Publicly Available?
6
gravatar for Pascal
7.3 years ago by
Pascal1.4k
Barcelona
Pascal1.4k wrote:

Hi

Where could I find data sets of cancer pairs (tumor/normal genome) publicly available?

I would like to test some tools of variants detection against it.

I have found already the data set of Complete Genomics but it doesn't come in the SAM format. They provide tools to convert to SAM but so far I haven't managed to run/compile those.

So please let me know if there are any other data sets available.

Regards

dataset sam cancer • 8.7k views
ADD COMMENTlink modified 7.3 years ago by Yogesh Pandit490 • written 7.3 years ago by Pascal1.4k

I guess here you want pairs with known/expected somatic mutations and frequencies. Did you find anything in this format?

ADD REPLYlink written 7.3 years ago by Travis2.8k

Yes that would be interesting to have this too. But first I would like to have an alignment of short reads of a tumor against a "normal" genome. I "think" that this is what I would like to have.

ADD REPLYlink written 7.3 years ago by Pascal1.4k

It depends on the approach but I would assume you start with 2 bam files - one for normal and one for tumor. The call variants for each and and compare. What program do you plan to test?

ADD REPLYlink written 7.3 years ago by Travis2.8k

Precisely, 2 bam files one for normal one for tumor. I'm benchmarking variants detection methods and I want to include a cancer dataset in my study. I'm trying to play with as much algorithms as I can (e.g. gatk, dindel, breakdancer, etc. and one home-made proto). Do you think this is a good idea?

ADD REPLYlink written 7.3 years ago by Pascal1.4k

I think that for normal vs tumor somatic mutation detection you need to use a specialised program that considers the heterogeneity of the tumor sample. Have a look at Somatic Sniper. Calling polymorphisms is a different game and more suited to Dindel, GATK. SAMTools etc.

ADD REPLYlink written 7.3 years ago by Travis2.8k

Few points on the CG data: it is from cell-lines, so not applicable if you're planning to bench mark variable fraction allele callers. In addition, converting CG data to BAM is a pain, and would require variant calling algorithms that are tuned to the characteristics of CG's mated gapped reads. Finally the CG datasets are at double the normal coverage.

ADD REPLYlink written 7.3 years ago by Greg Tyrelle70
5
gravatar for Giovanni M Dall'Olio
7.3 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Usually, datasets for cancer-related sequencing data are accessible for free, but they require to fill a Data Access request and to prove that you are from an academic source. This is because of privacy issues.

You can have a look at the data from the International Cancer Genome Consortium:

Note that they also provide a Biomart access to some of the data.

Alternatively, have a look at dbGAP from EBI. This is most GWAS data, but maybe you can find an use of it:

ADD COMMENTlink written 7.3 years ago by Giovanni M Dall'Olio26k

These are good sources +1. It is not just privacy issues, though, but commercialization issues that also drive the need to complete a Data Access request.

ADD REPLYlink written 7.3 years ago by Larry_Parnell16k
5
gravatar for Chris Miller
7.3 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

If you hurry, you can grab reads from The Cancer Genome Atlas at the Sequence Read Archive here. This deposition of raw sequence data is hugely costly (and many would argue wasteful) so after Dec 31st, it won't be hosted there anymore. There may be a plan for hosting it elsewhere, but I'm not sure of the details at the moment.

ADD COMMENTlink written 7.3 years ago by Chris Miller20k
4
gravatar for Larry_Parnell
7.3 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

Much work in cancer genomics has been done at the Broad Institute. Some of their data are available here and here. Similarly, you could see what the genome center at Washington Univ. School of Medicine has to offer, and Baylor College of Medicine and the Wellcome Trust genome centers. While some of those data may also reside in dbGAP, some data will not be there.

ADD COMMENTlink written 7.3 years ago by Larry_Parnell16k
1
gravatar for Yogesh Pandit
7.3 years ago by
Yogesh Pandit490
United States
Yogesh Pandit490 wrote:

You can find some more Cancer related data at synapse.sagebase.org

ADD COMMENTlink written 7.3 years ago by Yogesh Pandit490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 926 users visited in the last hour