Question

Reference / Gold-standard FASTQ file for NGS pipeline comparison ?

0

Entering edit mode

4.5 years ago

knsvar ▴ 20

Dear colleagues,

In order to evaluate the performance of different NGS analysis pipelines, I would need to use a gold-standard or reference FASTQ file.

In this way it would be possible to check for true/false positive calls as well as false negative ones using well characterized data, etc.

I was wondering if such files are freely or commercially available and/or if you could have any other suggestions. It would also appreciate your opinion on generating artificial FASTQ files (such as those discussed in : 'Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines') or Illumina's Platinum Genomes.

Thank you in advance for your help!

Konstantinos

next-gen sequencing • 1.1k views

ADD COMMENT • link 4.5 years ago by knsvar ▴ 20

1

Entering edit mode

What would be special about this "gold standard" FASTQ data? How would it be different from any random FASTQ data?

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

As Illumina says they "have derived a set of high-confidence variant calls for NA12877 and NA12878, by taking into account the inheritance constraints in the pedigree (after having sequenced many of its members) and the concordance of variant calls across different methods". So they have ended-up producing a vcf which would serve as comparison for different methods/pipelines (you compare your output eg. vcf with what your output "should be".

ADD REPLY • link 4.5 years ago by knsvar ▴ 20

1

Entering edit mode

What does that have to do with FASTQ data?

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Starting with the same FASTQ file and using different pipelines, you will end up with different VCFs. If you compare these VCFs with the VCF accompanying the reference FASTQ, you have a measure / an idea of what each pipeline may have missed or may have called incorrectly.

ADD REPLY • link 4.5 years ago by knsvar ▴ 20

0

Entering edit mode

OK, once again, what is special about this FASTQ file? What makes it different from any random FASTQ file that is arbitrarily used in all pipelines?

ADD REPLY • link 4.5 years ago by Ram 43k

score 2 · Answer 1 · 2019-10-17

2

Entering edit mode

4.5 years ago

GenoMax 141k

You can use data available from Genome In a Bottle project. Note this data is from a NIST (US National Institute of Standards and Technology) project.

ADD COMMENT • link 4.5 years ago by GenoMax 141k

1

Entering edit mode

As per my colleague, genomax, GIAB (Genome in a Bottle) is the standard that most people use for genetic variant calling, and also other DNA variants (structural variants, etc). A caveat, though, is that GIAB was never itself probed with a gold standard technology; so, the 'truth set' that pertains to GIAB is simply derived by multiple NGS analysis pipelines looking at the same raw data and producing a consensus list of variant calls. The possibility still exists that each may miss a genuine call (false-negative), or report false-positives.