2.3 years ago by
These two gold standard datasets are obtained from different base sequencing data and quite different validation/integration methods.
The GIAB set uses sequencing of NA12878 only, using several different sequencing technologies, the results from several analysis pipelines are then integrated to produce the final set. The GIAB set tends to be a little more conservative and this is probably why the variant set is a little smaller.
The Illumina PG VCF is derived from relatively old Illumina sequencing of the three-generation CEPH-1463 pedigree (of which NA12878 is a member). Various analysis pipelines are run on all members of the pedigree, and then the pedigree information is utilized to weed out calls which are inconsistent with pedigree.
Also, I am not sure whether you are aware, but both of these sets have accompanying BED files that specify the regions of high confidence, so one explanation for a variant being present in PG but not GIAB may be when that region is not included in the GIAB regions of high confidence.
Both of these datasets are continually being improved and there seems to be gradual convergence. In particular, the Illumina PG set is in the process of being updated with more modern sequencing data, and is looking to include calls currently present only in GIAB if sufficient evidence is present.