Question

What are most recommended / state-of-the-art whole genome FASTQ datasets for benchmarking purposes?

1

Entering edit mode

9.0 years ago

Irene@Sequencing.com ▴ 270

What whole genome datasets would you prefer to be used for benchmarking purposes (such as for benchmarking aligners, callers, etc.)?

While NA12878 (as well as NA12891 and NA12892) have multiple datasets available and have been used extensively for benchmarking I wanted to see if the community had recommendations for other whole genome datasets that may have been sequenced using more start-of-the-art technology. Please also provide the url to the dataset(s), if available. Thanks!

FASTQ Variant-calling conversions Alignment BAM • 4.1k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by Irene@Sequencing.com ▴ 270

Ram · Answer 1 · 2015-04-11

For whole-genome germline variant calling, the two benchmarking data sets I use are Genome In A Bottle (GIAB) and the CHM1-NA12878 pair (hapdip). For the former, you can use any NA12878 reads, e.g. from Platinum genomes. I would recommend NA12878 data from BaseSpace over AllSeq. In BaseSpace (free registration required), there are NA12878 produced from all kinds of Illumina machines with both PCR+ and PCR-free prep. In addition, AllSeq said that the data was intended to be available through 09/30/2014. We don't know when it will be pulled off. I guess Illumina is big enough to host their data longer (via S3). For the latter, you can find the links to the raw data here. That repo also provides evaluation scripts.

GIAB and hapdip are complementary to each other. GIAB is a "typical" benchmark. It provides truth data and you compare your calls against the truth. However, GIAB is biased towards easy regions. GIAB is also "excessively" clean when it excludes potential CNVs in NA12878. Given a new sample, identifying CNVs itself is non-trivial. In the end, you frequently get an underestimated error rate. In comparison, hapdip is largely unbiased, but it is more complicated as you have to deal with all kinds of tricky artifacts in variant calling. The data available for this type of benchmark is also limited. I would recommend to use both benchmarks if you want to get a more complete picture.

score 3 · Answer 2 · 2015-04-10

3

Entering edit mode

9.0 years ago

Pierre Lindenbaum 161k

I would suggest Illumina Platinium WGS : http://www.illumina.com/platinumgenomes/

ADD COMMENT • link 9.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Greatly appreciate reminding me about the Platinum Genomes from Illumina - I've accessed them and will be utilizing the NA12878 FASTQs for benchmarking.

ADD REPLY • link 9.0 years ago by Irene@Sequencing.com ▴ 270

score 1 · Answer 3 · 2015-04-10

1

Entering edit mode

9.0 years ago

donfreed ★ 1.6k

Although access has supposedly closed, X ten WGS data is available from AllSeq.

http://allseq.com/x-ten-test-data

Previously mentioned in this thread:

Free HiSeq X Ten human genome fastq test data

ADD COMMENT • link 9.0 years ago by donfreed ★ 1.6k

0

Entering edit mode

Thank you - I was able to access the FASTQ files and I'll utilize this as a bechmark!

ADD REPLY • link 9.0 years ago by Irene@Sequencing.com ▴ 270

score 0 · Answer 4 · 2015-04-11

0

Entering edit mode

9.0 years ago

Brian Bushnell 20k

For benchmarking, synthetic data - for which you know the correct answer - is much better than real data, for which the truth is subjective.

ADD COMMENT • link 9.0 years ago by Brian Bushnell 20k

score 0 · Answer 5 · 2017-12-12

0

Entering edit mode

6.4 years ago

jackyen • 0

Hi, I'm curious if anyone knows if there's any publically available NA12878 RNA-seq data? Thanks