Question: Difference between NIST VCF and Illumina Platinum Genomes VCF for NA12878
Hi, I find two standard data set for sample NA12878.



But I find the number of variant from Illumina Platinum is a little more than NIST data set.

Also, sanger sequencing validated that the variant in Illumina Platinum not in NIST is true positive.

Dose anyone know how to explain this? And which variant will reported into golden standard data set?


These two gold standard datasets are obtained from different base sequencing data and quite different validation/integration methods.

The GIAB set uses sequencing of NA12878 only, using several different sequencing technologies, the results from several analysis pipelines are then integrated to produce the final set. The GIAB set tends to be a little more conservative and this is probably why the variant set is a little smaller.

The Illumina PG VCF is derived from relatively old Illumina sequencing of the three-generation CEPH-1463 pedigree (of which NA12878 is a member). Various analysis pipelines are run on all members of the pedigree, and then the pedigree information is utilized to weed out calls which are inconsistent with pedigree.

Also, I am not sure whether you are aware, but both of these sets have accompanying BED files that specify the regions of high confidence, so one explanation for a variant being present in PG but not GIAB may be when that region is not included in the GIAB regions of high confidence.

Both of these datasets are continually being improved and there seems to be gradual convergence. In particular, the Illumina PG set is in the process of being updated with more modern sequencing data, and is looking to include calls currently present only in GIAB if sufficient evidence is present.

