Hello all :)
When using sequencing technologies which do not cover the entire genome when calling SNPs (such as calling SNPs from RNA-Seq data or ChIP-Seq data), I am unsure how to do a comparison on the VCF files alone.
For example, I ran the GATK pipeline on RNA-Seq data (https://www.broadinstitute.org/gatk/guide/article?id=3891) to give me VCF files.
All experiments were done on genetically identical mice - however one of my samples I had 10x as many reads as the others. At the end of the GATK pipeline, this sample also has about 3x as many variants called.
So chances are that the SNPs in the deep sample are also present in the shallow samples, it's just that there wasn't as much coverage and/or confidence in these shallow samples for GATK to call them.
Problem now is that when I use VCFTools's vcf-compare, the resultant venn diagram shows considerable number of SNPs unique to the deep sample, and I think this is unfair. Really I need 3 bits of information - SNPs ALT in a sample, SNPs REF in a sample, and 'UNKNOWN' in a sample or samples because of lack of data.
In other words, having a row in the VCF tells me there is a variant, but no row can either be no variant (like REF) or no data.
In an ideal world, much like STAR's 2-pass aligner used in the GATK protocol, there would be a 2-pass SNP calling, where SNPs are found de novo, then using this information from all samples, the reads are reanalysed to see if they match, dont match, or it is unknown if they match.
Is this possible?
Thanks so much for your time! :)