Question: Mpileup On Merged .Bam File
gravatar for Kssr
7.7 years ago by
Kssr110 wrote:

I have 30 + bam files and I have merged the data using samtools merge(adding RG tag using -rh option).I ran mpileup on each of the .bam files separately and ran mpileup on the merged data to look at average coverage.Say if I have 10x average coverage in each of my individual samples,I am expecting a coverage of roughly 10x*n in the merged case where n is the number of samples .However,I see my average coverage on the merged bam dropping to roughly half the value and some of the bases included in individual samples pileup are not included in the merged bam pileup.I am not sure how samtools mpileup is interpreting the merged bam file especially with RG tags or I am wondering if the merge has been done properly or not.Any help would be appreciated.

bam samtools coverage • 7.6k views
ADD COMMENTlink modified 4.3 years ago by Biostar ♦♦ 20 • written 7.7 years ago by Kssr110

Just a thought but are these bams marked for duplicates?

ADD REPLYlink written 7.7 years ago by Vivek2.4k

The bams are not marked for duplicates

ADD REPLYlink written 7.7 years ago by Kssr110

@ Vivek  , I need to ask if the bam files are marked for duplicates and then recaliberated using GATK , does not they work properly with the samtools mpileup?

ADD REPLYlink written 5.7 years ago by ivivek_ngs4.9k

That shouldn't be an issue. I use samtools mpileup all the time on GATK processed BAMs.

ADD REPLYlink written 5.7 years ago by Vivek2.4k

ok thanks probably the version of samtools am using is not updated. I will try with a newer version then. Thanks.

ADD REPLYlink written 5.7 years ago by ivivek_ngs4.9k

Can you post example command strings for each of the steps you describe? How exactly are you calculating average coverage?

ADD REPLYlink written 7.7 years ago by matted7.3k

Merge: samtools merge -rh rg.txt out.merged.bam *.sort.bam

mpileup on merged bam:

samtools mpileup -D -C 50 -q 20 -Q 20 -f ref.fasta  out.merged.bam >merged.mpileup

Average coverage calculations:

awk '{sum+=$4} END {print sum,"\t",sum/genome_length,"\t",NR}'  merged.mpileup

where sum/genome_length should give me the coverage

Also I am getting the following error when I run flag stat on merged bam

samtools flagstat out.merged.bam
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_flagstat_core] Truncated file? Continue anyway.

As far as my knowledge goes this shows up whenever I read uncompressed bam files.So,I think it is not an error and the data has been merged properly.Any ideas?

ADD REPLYlink modified 7.7 years ago by Istvan Albert ♦♦ 84k • written 7.7 years ago by Kssr110

just a suggestion: I have systematically been more successful on analyzing merged bams when creating them using picard's MergeSamFiles rather than samtools' merge, although this well may be due to a wrong samtools usage on my side.

ADD REPLYlink written 7.7 years ago by Jorge Amigo11k
gravatar for matted
7.7 years ago by
Boston, United States
matted7.3k wrote:

Is your sequencing very high coverage? When you run samtools mpileup, it should say something like:

[mpileup] 200 samples in 200 input files
<mpileup> Set max per-file depth to 40

It only outputs the max depth message if it has to raise it from the default, or if you request a value that will result in high memory usage, like:

[mpileup] 200 samples in 200 input files
(mpileup) Max depth is above 1M. Potential memory hog!

What does it say in your case? I think the limit will be 8000 / # samples, which will cause you problems if coverage gets above 266 in a sample (for n=30).

Try changing to higher values of -d, or better yet, use samtools depth. It will do most of the filtering you request in your example (it has -q and -Q), and it shouldn't have any depth constraints (since it doesn't call variants).

I believe the truncated file warning can be ignored and is an artifact from writing uncompressed BAM files. I think the latest version of samtools (or the git version) fixes this.

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by matted7.3k

This is what I get when I run mpileup on individual and merged data respectively [mpileup] 1 samples in 1 input files <mpileup> Set max per-sample depth to 8000.

[mpileup] 36 samples in 1 input files

My individual data have avg coverages of roughly 10x and some bases have really high coverage(>8000 and some of them above 200).I am thinking the coverage you are talking about is the avg coverage/depth per sample but not per base position.I am still not clear about the depth constraints in samtools.

I tried samtools depth and it is giving me the expected coverage.GATK depthofcoverage and bedtools genomecoverage bed give similar results(doesn't include some of the bases though).

ADD REPLYlink written 7.7 years ago by Kssr110

The max depths are per-base, since samtools has to hold all local reads in memory to do variant calling around a locus. So if you have areas of high coverage, this could be causing the difference.

And to clarify, did samtools depth give you the correct answer?

ADD REPLYlink written 7.7 years ago by matted7.3k

Thanks for the clarification regarding depth.Samtools depth is giving me the correct answer.I checked this with coverage from GATK DepthOfCoverage and BEDTools as well.

ADD REPLYlink written 7.7 years ago by Kssr110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour