I am validating a bioinformatics pipeline for SNP and INDEL calling. For this purpose I mapped the reads from Illumina Platinum Genome (https://www.ebi.ac.uk/ena/data/view/ERR194147) to hg38 assembly (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/) and called variants on a smaller subset region chr19:29,207,790-29,217,448. To verify the detection of these variants I used two data sets
- From Illumina Platinum Genome (ftp://firstname.lastname@example.org/2017-1.0/hg38/hybrid/)
- From NIST (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/)
There is one variant at location chr19 :29215367 which is detected by variant calling pipeline at frequency > 50%. This variant is present in NIST variant dataset but not in illumina platinum genome.
This variant is also present in the CRAM (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12878/alignment/) file downloaded from Illumina PG site at frequency 54% as visualized by IGV.
Should I use NIST variants instead of Illumina PG, as this variant will be shown as False Positive.
Can I merge both these dataets to get more comprehensive variant call. Will it be advisable to merge the two datasets?
I will appreciate an insight into this.