We recently published a paper and made available a comprehensive human NGS cancer dataset for tool development, algorithm benchmarking, teaching, pipeline evaluation, etc.
This data is available for download directly from our FTP site.
Briefly, we sequenced a breast cancer cell line and matched normal lymphoblastoid cell line derived from the same individual. WGS, exome and RNA-seq data was produced for both of these samples. The data is all 2x100 bp Illumina reads from the HiSeq2000 platform.
A total of 10 lanes of HiSeq 2000 (v3 chemistry) sequence data consisting of ~1.8 billion 2x100bp reads were produced for HCC1395 and HCC1395/BL. Whole genome sequencing, exome sequencing and RNA-seq were performed as described previously. HCC1395 and HCC1395/BL were sequenced to average coverage levels of 56x (WGS)/155x (exome) and 31X (WGS)/124x (exome), respectively. RNA sequencing achieved 20x coverage of >50% of known junctions for 8,640 genes for HCC1395 and 9,437 genes for HCC1395/BL respectively. (source)
We provide this data in several versions. One is all of the data, but we also provide versions that have been downsampled to 1/100th, 1/1000th, and exome only.
A detailed description of all data files is provided here.
We describe a basic analysis of this data in the publication listed below. While this data represents only a single tumor/normal pair, we hope that this data will be useful to people who are: (a) developing alignment or variant calling algorithms/tools, (b) running educational workshops, and (c) benchmarking pipelines.
If you find this data useful, please cite: