Hello,
I am calling SNPs in some "resequencing" data for isolates of a microbial eukaryote from different parts of the world. I initially aligned each isolate to the "reference" genome separately and called SNPs on them individually. However, most of the protocols ("Simple Fool's Guide", GATK site, etc) recommend merging the individual BAMs and making inferences on the merged file. I tried this with samtools mpileup using the same commands as before and it produced far more variants (about 6x more) than the total from calling on individual isolates.
The protocol for variant calling I am using is:
samtools mpileup -uf reference.fa -L 10000 bamfile | bcftools view -bvcg -> var.raw.vcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -d 10 > var.flt.vcf
awk '($3=="*"&&$6>=60)||($3!="*"&&$6>=60)' var.flt.vcf > out.snps.vcf4
I have two questions about this.
Is the reason that I am getting many more variants on the merged BAM because the -d coverage level applies across all the samples in the merged case? (so, I might use -d 80 as an equivalent for 8 samples?)
If so, this implies that mpileup is somehow using data from all the different samples to determine which sites are variants. This seems problematic in my case, because I don't know the population history --- the isolates are not all from the same population, but from different ones (although they are all part of the same "species", to the extent that it is meaningful to use that term). Therefore, I would not like to assume a priori that e.g. the same sites are likely to be variable in all the populations. Does this suggest that doing the calling on the individual sample BAMs might be more justified?
If (2) is correct, that seems to mean that I won't be able to use GATK and the variant recalibration approach - each sample would have to be calibrated on its own, and there wouldn't be enough high quality variants (a couple of thousand variants and perhaps a few hundred really certain "true positives" in each case.
I'm aware these are basic questions, if there is any literature that discusses them I'd be grateful of a link...