Question: Samtools: merge and mpileup vs mpileup alone for variant-calling with multiple BAM
gravatar for Rob
15 months ago by
Rob90 wrote:

Hi, I'm doing some variant-calling, and I was wondering what is the difference between:

merging BAM files with samtools merge and call samtools mpileup on the merged BAM


skip the merge step and call mpileup with all the BAM files?

I did both and expected the same results, but I got 15% of identical SNPs, 60% close to identical but lower quality with the samtools merge step, and the others SNPs are not find by one or the other method.

Is it possible that the quality filter applied in mpileup affect each file separately and cause this behaviour or is it something else?

ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 15 months ago by Rob90

More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.

ADD REPLYlink modified 15 months ago • written 15 months ago by Brian Bushnell15k

But since the RG tags are identical for one sample, doesn't mpileup merge files itself? And so the coverage should be identical compared to the step with merge no?

ADD REPLYlink modified 15 months ago • written 15 months ago by Rob90

that is the part that i don't know if mpileup treats 2 sets of RG tags as if you have passed 2 separate files to mpileup, generating 1 variant call file, but with calls for each sample. Best way just to it until getting vcf and see how many individuals you get in the end

ADD REPLYlink written 15 months ago by stolarek.ir550

The strange things with this is that I have more coverage on a single SNPs without merge the files before. Here one example (same for all SNPs common in both methods) :

                   Ref   Alt     QUAL     GT     DP
Without merge:      T     C    48.0158    0/1    26
With merge   :      T     C    36.8363    0/1    23
ADD REPLYlink modified 15 months ago • written 15 months ago by Rob90

This VCF has only one sample? Hmmm, looks to me like something strange is going on.

ADD REPLYlink written 15 months ago by Brian Bushnell15k

Yep, only one sample (same RG tag for each Lane)

ADD REPLYlink written 15 months ago by Rob90

What you are describing should not happen. But at least, please make sure you are using the latest versions of all the software (I assume that would be samtools and bcftools); that can often resolve issues. It might also help if you could post the exact command lines you used in each case.

ADD REPLYlink written 15 months ago by Brian Bushnell15k

that reminds me also of my old problem a bit using samtools 1.19 and bcftools 1.2. Either use both of the 1.19 or both at 1.2 versions <- you can also experiment with that, and some crazy bat crap is happening sometimes

ADD REPLYlink written 15 months ago by stolarek.ir550

if you just did merge on BAMs, that just adds up coverage and mpileup has no idea, that it's using two files. Depending on how low the coverage was on individual files, the results will vary, though most of the SNP calls should not change <- but again coverage of the individual files

ADD REPLYlink written 15 months ago by stolarek.ir550

But what mpileup do with the 2 files informations? Since it's the same sample (with identical @RG tag), does not it merge them together too?

ADD REPLYlink written 15 months ago by Rob90

Never really did it, but it could treat them as separate samples if when merging you applied RG tags, and then didn't ignore them at mpileup, but you should google that, as I'm not sure

ADD REPLYlink written 15 months ago by stolarek.ir550

I will try more research, because yeah, that is the information I'm looking for :) So which way do you personally suggest as the best? Merge first sample with same RG tags, and do variant calling on the merged file?

ADD REPLYlink written 15 months ago by Rob90

anytime you have more data, and all previous steps are exactly the same <- you are not trying to compare anything between those 2 technical replicates, then just merge and call on that one big file like you did

ADD REPLYlink written 15 months ago by stolarek.ir550

Hi Rob, my advice is to not merge the BAMs and to instead pass them separately to samtools mpileup. I see no reason to merge. Take a look at my workflow here where I actually downsample each sample and then pass a total of 4 BAMs to mpileup for each sample:

SAMtools treats each as separate lines of evidence going into each variant call.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe19k


Can you provide some context/detail. The original post (and this is an issue we have as well) is about whether the variant calling behaves differently depending on whether all the samples (with appropriate RG info) are merged into a single BAM prior to running samtools mpileup, or having each individual BAM file inputted together as a list in the call to samtools mpileup. As the post demonstrated this results in a difference. The documentation is a bit thin on this, but in their own workflow, they do seem to suggest it

Have you experimentally verified that for your data you are getting the same thing? Based on my reading, I would think you would do better (in particular with samples with lower individual coverage) to have them all merged prior. I look forward to your thoughts.

ADD REPLYlink written 10 weeks ago by ian.dworkin0

I have not replicated this exactly but what is happening is the following: When you merge the samples (by RG) into a single BAM and then call variants on this, the downsampling will only look so far into the reads and will then make a variant call. However, when you pass the BAMs separately to samtools mpileup, it will perform the downsampling and variant calling separately on each BAM and then take a look at the consensus calls.

The pitfall to passing multiple BAMs separately is that you increase the likelihood of missing singletons, i.e., variants present in just a single sample.

You may read about the work that my colleagues and I did in the UK:

ADD REPLYlink written 10 weeks ago by Kevin Blighe19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour