Question: Samtools: merge and mpileup vs mpileup alone for variant-calling with multiple BAM
1
gravatar for Rob
2.8 years ago by
Rob100
Rob100 wrote:

Hi, I'm doing some variant-calling, and I was wondering what is the difference between:

merging BAM files with samtools merge and call samtools mpileup on the merged BAM

and

skip the merge step and call mpileup with all the BAM files?

I did both and expected the same results, but I got 15% of identical SNPs, 60% close to identical but lower quality with the samtools merge step, and the others SNPs are not find by one or the other method.

Is it possible that the quality filter applied in mpileup affect each file separately and cause this behaviour or is it something else?

ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 2.8 years ago by Rob100
2

More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Brian Bushnell16k

But since the RG tags are identical for one sample, doesn't mpileup merge files itself? And so the coverage should be identical compared to the step with merge no?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Rob100

that is the part that i don't know if mpileup treats 2 sets of RG tags as if you have passed 2 separate files to mpileup, generating 1 variant call file, but with calls for each sample. Best way just to it until getting vcf and see how many individuals you get in the end

ADD REPLYlink written 2.8 years ago by stolarek.ir620

The strange things with this is that I have more coverage on a single SNPs without merge the files before. Here one example (same for all SNPs common in both methods) :

                   Ref   Alt     QUAL     GT     DP
Without merge:      T     C    48.0158    0/1    26
With merge   :      T     C    36.8363    0/1    23
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Rob100

This VCF has only one sample? Hmmm, looks to me like something strange is going on.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

Yep, only one sample (same RG tag for each Lane)

ADD REPLYlink written 2.8 years ago by Rob100

What you are describing should not happen. But at least, please make sure you are using the latest versions of all the software (I assume that would be samtools and bcftools); that can often resolve issues. It might also help if you could post the exact command lines you used in each case.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

that reminds me also of my old problem a bit using samtools 1.19 and bcftools 1.2. Either use both of the 1.19 or both at 1.2 versions <- you can also experiment with that, and some crazy bat crap is happening sometimes

ADD REPLYlink written 2.8 years ago by stolarek.ir620

More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.

I would just like to question this: our empirical data suggests that increased depth of coverage does not increase confidence in the calls - it is a common misconception that more is better, in terms of variant calling. Due to the fact that NGS reads contain so much error, increase depth of coverage can, I believe, actually make it more difficult for the variant callers to make an accurate call.

Our data suggests that ideal call range for germline variants is 18-30 read depth.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe51k
2
gravatar for stolarek.ir
2.8 years ago by
stolarek.ir620
Poland
stolarek.ir620 wrote:

if you just did merge on BAMs, that just adds up coverage and mpileup has no idea, that it's using two files. Depending on how low the coverage was on individual files, the results will vary, though most of the SNP calls should not change <- but again coverage of the individual files

ADD COMMENTlink written 2.8 years ago by stolarek.ir620

But what mpileup do with the 2 files informations? Since it's the same sample (with identical @RG tag), does not it merge them together too?

ADD REPLYlink written 2.8 years ago by Rob100
1

Never really did it, but it could treat them as separate samples if when merging you applied RG tags, and then didn't ignore them at mpileup, but you should google that, as I'm not sure

ADD REPLYlink written 2.8 years ago by stolarek.ir620

I will try more research, because yeah, that is the information I'm looking for :) So which way do you personally suggest as the best? Merge first sample with same RG tags, and do variant calling on the merged file?

ADD REPLYlink written 2.8 years ago by Rob100

anytime you have more data, and all previous steps are exactly the same <- you are not trying to compare anything between those 2 technical replicates, then just merge and call on that one big file like you did

ADD REPLYlink written 2.8 years ago by stolarek.ir620

Hi Rob, my advice is to not merge the BAMs and to instead pass them separately to samtools mpileup. I see no reason to merge. Take a look at my workflow here where I actually downsample each sample and then pass a total of 4 BAMs to mpileup for each sample: https://github.com/kevinblighe/ClinicalGradeDNAseq

SAMtools treats each as separate lines of evidence going into each variant call.

ADD REPLYlink modified 21 months ago • written 21 months ago by Kevin Blighe51k

Kevin,

Can you provide some context/detail. The original post (and this is an issue we have as well) is about whether the variant calling behaves differently depending on whether all the samples (with appropriate RG info) are merged into a single BAM prior to running samtools mpileup, or having each individual BAM file inputted together as a list in the call to samtools mpileup. As the post demonstrated this results in a difference. The documentation is a bit thin on this, but in their own workflow, they do seem to suggest it http://www.htslib.org/workflow/#mapping_to_variant

Have you experimentally verified that for your data you are getting the same thing? Based on my reading, I would think you would do better (in particular with samples with lower individual coverage) to have them all merged prior. I look forward to your thoughts.

ADD REPLYlink written 20 months ago by ian.dworkin0

I have not replicated this exactly but what is happening is the following: When you merge the samples (by RG) into a single BAM and then call variants on this, the downsampling will only look so far into the reads and will then make a variant call. However, when you pass the BAMs separately to samtools mpileup, it will perform the downsampling and variant calling separately on each BAM and then take a look at the consensus calls.

The pitfall to passing multiple BAMs separately is that you increase the likelihood of missing singletons, i.e., variants present in just a single sample.

ADD REPLYlink modified 8 months ago • written 20 months ago by Kevin Blighe51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2062 users visited in the last hour