Reference / Gold-standard FASTQ file for NGS pipeline comparison ?
2
0
Entering edit mode
4.5 years ago
knsvar ▴ 20

Dear colleagues,

In order to evaluate the performance of different NGS analysis pipelines, I would need to use a gold-standard or reference FASTQ file.

In this way it would be possible to check for true/false positive calls as well as false negative ones using well characterized data, etc.

I was wondering if such files are freely or commercially available and/or if you could have any other suggestions. It would also appreciate your opinion on generating artificial FASTQ files (such as those discussed in : 'Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines') or Illumina's Platinum Genomes.

Thank you in advance for your help!

Konstantinos

next-gen sequencing • 1.1k views
ADD COMMENT
1
Entering edit mode

What would be special about this "gold standard" FASTQ data? How would it be different from any random FASTQ data?

ADD REPLY
0
Entering edit mode

As Illumina says they "have derived a set of high-confidence variant calls for NA12877 and NA12878, by taking into account the inheritance constraints in the pedigree (after having sequenced many of its members) and the concordance of variant calls across different methods". So they have ended-up producing a vcf which would serve as comparison for different methods/pipelines (you compare your output eg. vcf with what your output "should be".

ADD REPLY
1
Entering edit mode

What does that have to do with FASTQ data?

ADD REPLY
0
Entering edit mode

Starting with the same FASTQ file and using different pipelines, you will end up with different VCFs. If you compare these VCFs with the VCF accompanying the reference FASTQ, you have a measure / an idea of what each pipeline may have missed or may have called incorrectly.

ADD REPLY
0
Entering edit mode

OK, once again, what is special about this FASTQ file? What makes it different from any random FASTQ file that is arbitrarily used in all pipelines?

ADD REPLY
2
Entering edit mode
4.5 years ago
GenoMax 141k

You can use data available from Genome In a Bottle project. Note this data is from a NIST (US National Institute of Standards and Technology) project.

ADD COMMENT
1
Entering edit mode

As per my colleague, genomax, GIAB (Genome in a Bottle) is the standard that most people use for genetic variant calling, and also other DNA variants (structural variants, etc). A caveat, though, is that GIAB was never itself probed with a gold standard technology; so, the 'truth set' that pertains to GIAB is simply derived by multiple NGS analysis pipelines looking at the same raw data and producing a consensus list of variant calls. The possibility still exists that each may miss a genuine call (false-negative), or report false-positives.

ADD REPLY
1
Entering edit mode

Thank you both for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6