I'm testing my somatic variant calling pipeline and I'm looking at Cancer Genome in a Bottle (GIAB) data. I found FASTQ files from the HG008-T sample (a pancreatic ductal adenocarcinoma), but they were generated using Hi-C sequencing:
HG008-T_HiC_PhaseGenomics_20241211_R1.fastq.gz
HG008-T_HiC_PhaseGenomics_20241211_R2.fastq.gz
Since Hi-C isn't ideal for small variant calling (like with Illumina, Thermo Fisher, or Nanopore WGS/WES), I was wondering:
Are these the correct validated VCFs for that sample? https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_somatic/HG008/Liss_lab/analysis/NIST_HG008-T_somatic-stvar_DraftBenchmark_V0.3-20250220/
Any advice on how to proceed?
Finding recent public human tumor sequence data (and VCF) is going to be rare because of patient privacy concerns. You could sign up and access all types of cancer data via dbGaP (you would need to be a PI or someone with authority to sign to submit such a project proposal).
There are some public datasets mentioned in this past thread: Publicly Available Tumor/Normal Illumina Data For Evaluation Of Somatic Variant Callers