Where Can I Find Data Sets Of Cancer Publicly Available?
4
6
Entering edit mode
12.4 years ago
Pascal ★ 1.5k

Hi

Where could I find data sets of cancer pairs (tumor/normal genome) publicly available?

I would like to test some tools of variants detection against it.

I have found already the data set of Complete Genomics but it doesn't come in the SAM format. They provide tools to convert to SAM but so far I haven't managed to run/compile those.

So please let me know if there are any other data sets available.

Regards

dataset cancer sam • 12k views
ADD COMMENT
0
Entering edit mode

I guess here you want pairs with known/expected somatic mutations and frequencies. Did you find anything in this format?

ADD REPLY
0
Entering edit mode

Yes that would be interesting to have this too. But first I would like to have an alignment of short reads of a tumor against a "normal" genome. I "think" that this is what I would like to have.

ADD REPLY
0
Entering edit mode

It depends on the approach but I would assume you start with 2 bam files - one for normal and one for tumor. The call variants for each and and compare. What program do you plan to test?

ADD REPLY
0
Entering edit mode

Precisely, 2 bam files one for normal one for tumor. I'm benchmarking variants detection methods and I want to include a cancer dataset in my study. I'm trying to play with as much algorithms as I can (e.g. gatk, dindel, breakdancer, etc. and one home-made proto). Do you think this is a good idea?

ADD REPLY
0
Entering edit mode

I think that for normal vs tumor somatic mutation detection you need to use a specialised program that considers the heterogeneity of the tumor sample. Have a look at Somatic Sniper. Calling polymorphisms is a different game and more suited to Dindel, GATK. SAMTools etc.

ADD REPLY
0
Entering edit mode

Few points on the CG data: it is from cell-lines, so not applicable if you're planning to bench mark variable fraction allele callers. In addition, converting CG data to BAM is a pain, and would require variant calling algorithms that are tuned to the characteristics of CG's mated gapped reads. Finally the CG datasets are at double the normal coverage.

ADD REPLY
5
Entering edit mode
12.4 years ago

Usually, datasets for cancer-related sequencing data are accessible for free, but they require to fill a Data Access request and to prove that you are from an academic source. This is because of privacy issues.

You can have a look at the data from the International Cancer Genome Consortium:

Note that they also provide a Biomart access to some of the data.

Alternatively, have a look at dbGAP from EBI. This is most GWAS data, but maybe you can find an use of it:

ADD COMMENT
0
Entering edit mode

These are good sources +1. It is not just privacy issues, though, but commercialization issues that also drive the need to complete a Data Access request.

ADD REPLY
5
Entering edit mode
12.4 years ago

If you hurry, you can grab reads from The Cancer Genome Atlas at the Sequence Read Archive here. This deposition of raw sequence data is hugely costly (and many would argue wasteful) so after Dec 31st, it won't be hosted there anymore. There may be a plan for hosting it elsewhere, but I'm not sure of the details at the moment.

ADD COMMENT
4
Entering edit mode
12.4 years ago

Much work in cancer genomics has been done at the Broad Institute. Some of their data are available here and here. Similarly, you could see what the genome center at Washington Univ. School of Medicine has to offer, and Baylor College of Medicine and the Wellcome Trust genome centers. While some of those data may also reside in dbGAP, some data will not be there.

ADD COMMENT
1
Entering edit mode
12.4 years ago
Yogesh Pandit ▴ 520

You can find some more Cancer related data at synapse.sagebase.org

ADD COMMENT

Login before adding your answer.

Traffic: 1757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6