Question: Difference between NIST VCF and Illumina Platinum Genomes VCF for NA12878
2
gravatar for sxzhuxu
22 months ago by
sxzhuxu40
sxzhuxu40 wrote:

Hi, I find two standard data set for sample NA12878.

1, ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz

2, ftp://platgene_ro@ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz

But I find the number of variant from Illumina Platinum is a little more than NIST data set.

Also, sanger sequencing validated that the variant in Illumina Platinum not in NIST is true positive.

Dose anyone know how to explain this? And which variant will reported into golden standard data set?

Thanks

sequencing snp sequence • 1.4k views
ADD COMMENTlink modified 21 months ago by Len Trigg1.2k • written 22 months ago by sxzhuxu40
4
gravatar for Len Trigg
21 months ago by
Len Trigg1.2k
New Zealand
Len Trigg1.2k wrote:

These two gold standard datasets are obtained from different base sequencing data and quite different validation/integration methods.

The GIAB set uses sequencing of NA12878 only, using several different sequencing technologies, the results from several analysis pipelines are then integrated to produce the final set. The GIAB set tends to be a little more conservative and this is probably why the variant set is a little smaller.

The Illumina PG VCF is derived from relatively old Illumina sequencing of the three-generation CEPH-1463 pedigree (of which NA12878 is a member). Various analysis pipelines are run on all members of the pedigree, and then the pedigree information is utilized to weed out calls which are inconsistent with pedigree.

Also, I am not sure whether you are aware, but both of these sets have accompanying BED files that specify the regions of high confidence, so one explanation for a variant being present in PG but not GIAB may be when that region is not included in the GIAB regions of high confidence.

Both of these datasets are continually being improved and there seems to be gradual convergence. In particular, the Illumina PG set is in the process of being updated with more modern sequencing data, and is looking to include calls currently present only in GIAB if sufficient evidence is present.

ADD COMMENTlink written 21 months ago by Len Trigg1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour