Question: (Closed) Variant Calling From Merged Bam When Population Evolutionary History Is Unknown
gravatar for Tancata
7.6 years ago by
Newcastle, UK
Tancata210 wrote:


I am calling SNPs in some "resequencing" data for isolates of a microbial eukaryote from different parts of the world. I initially aligned each isolate to the "reference" genome separately and called SNPs on them individually. However, most of the protocols ("Simple Fool's Guide", GATK site, etc) recommend merging the individual BAMs and making inferences on the merged file. I tried this with samtools mpileup using the same commands as before and it produced far more variants (about 6x more) than the total from calling on individual isolates.

The protocol for variant calling I am using is:

samtools mpileup -uf reference.fa -L 10000 bamfile | bcftools view -bvcg -> var.raw.vcf
bcftools view var.raw.bcf | varFilter -d 10 > var.flt.vcf
awk '($3=="*"&&$6>=60)||($3!="*"&&$6>=60)' var.flt.vcf > out.snps.vcf4

I have two questions about this.

  1. Is the reason that I am getting many more variants on the merged BAM because the -d coverage level applies across all the samples in the merged case? (so, I might use -d 80 as an equivalent for 8 samples?)

  2. If so, this implies that mpileup is somehow using data from all the different samples to determine which sites are variants. This seems problematic in my case, because I don't know the population history --- the isolates are not all from the same population, but from different ones (although they are all part of the same "species", to the extent that it is meaningful to use that term). Therefore, I would not like to assume a priori that e.g. the same sites are likely to be variable in all the populations. Does this suggest that doing the calling on the individual sample BAMs might be more justified?

If (2) is correct, that seems to mean that I won't be able to use GATK and the variant recalibration approach - each sample would have to be calibrated on its own, and there wouldn't be enough high quality variants (a couple of thousand variants and perhaps a few hundred really certain "true positives" in each case.

I'm aware these are basic questions, if there is any literature that discusses them I'd be grateful of a link...

ngs samtools • 2.4k views
ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 7.6 years ago by Tancata210

Hello Tancata!

We believe that this post has become a zombie post; we're closing it so biostars bot doesn't bump it again.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink modified 10 months ago • written 10 months ago by _r_am32k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1453 users visited in the last hour