Dear colleagues,
In order to evaluate the performance of different NGS analysis pipelines, I would need to use a gold-standard or reference FASTQ file.
In this way it would be possible to check for true/false positive calls as well as false negative ones using well characterized data, etc.
I was wondering if such files are freely or commercially available and/or if you could have any other suggestions. It would also appreciate your opinion on generating artificial FASTQ files (such as those discussed in : 'Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines') or Illumina's Platinum Genomes.
Thank you in advance for your help!
Konstantinos
What would be special about this "gold standard" FASTQ data? How would it be different from any random FASTQ data?
As Illumina says they "have derived a set of high-confidence variant calls for NA12877 and NA12878, by taking into account the inheritance constraints in the pedigree (after having sequenced many of its members) and the concordance of variant calls across different methods". So they have ended-up producing a vcf which would serve as comparison for different methods/pipelines (you compare your output eg. vcf with what your output "should be".
What does that have to do with FASTQ data?
Starting with the same FASTQ file and using different pipelines, you will end up with different VCFs. If you compare these VCFs with the VCF accompanying the reference FASTQ, you have a measure / an idea of what each pipeline may have missed or may have called incorrectly.
OK, once again, what is special about this FASTQ file? What makes it different from any random FASTQ file that is arbitrarily used in all pipelines?