I run into weird results when calling a list of variants from a gold standard genome NA12878. My goal is to evaluate how well perform two different calling tools, so I aim to produce a vcf containing a list of variants from a known genome and then compare it with the high confidence list produced by Illumina in Platinium Genomes (https://www.illumina.com/platinumgenomes.html) to find the true positive, false positive etc... It is quite similar with the work done in those two papers: http://www.nature.com/articles/srep17875 http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html
To do so, I have downloaded the sequences located here : ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/ So from what I understand, it corresponds to one of the sequencing runs for NA12878.
I then have merged the fastq in two files (paired end), trimmed the first 10 bp and aligned them using BowTie and the human genome 38 reference on Ensembl. ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz The output was a bam and a bai. So far so good. I hope.
Now I want to call variants so I compare them with the variants list provided by Illumina for NA12878 at this adress: ftp://firstname.lastname@example.org/2016-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz
After importing the BAM/BAI, I use the annotation file located on Ensmbl : ftp://ftp.ensembl.org/pub/release-87/regulation/homo_sapiens/homo_sapiens.GRCh38.motiffeatures.20161111.gff.gz and run the calling software.
Two problem arise:
- The Variant Effect Annotation step tells me the annotation contains no protein gene records. Alright, I can do without this information as long as I have a list of SNPs and indels as an output but that's weird.
- The actual SNP/indel output has identified something like 700 SNPs with default parameters. With parameters not stringent at all, I can find up to 3000 SNPs but nothing even remotely close to the millions found by Illumina.
Any comparison at this point is quite irrelevant. I probably have a problem with the various files I input, but can't figure out where is the problem coming from. Any thoughts on this question ? Is my "roadmap" to produce the list of SNPs and compare it to Illumina's one correct?
Thanks a lot for your input!