Counting number of bases covered across multiple bam files
3
0
Entering edit mode
8.1 years ago

Hi all,

Long-time lurker, first-time poster.

Is there a tool that can be used to count the number of bases with coverage >= X in each of 3 input files? (That is, if chr1:1,000,000 has 30 reads at that position in each of bams A,B,C, then that position would be counted towards the total.)

multiple bam coverage • 4.4k views
0
Entering edit mode
1
Entering edit mode
8.1 years ago

You could do this with pysam (or htslib if you prefer C) and an mpileup. This will allow you to iterate over each base in the genome and count the number of reads in each sample (possibly with filtering metrics that you can specify) that cover that base. This is relatively quick. You could also use samtools depth on each file separately (internally, this is just a pileup) and then parse the results to count the bases. This might prove to be simpler to write in fact.

0
Entering edit mode

Thanks! This seems doable

0
Entering edit mode

would you happen to have a small code sample that shows how to do this?

1
Entering edit mode
8.1 years ago

I sometimes need to do something very similar, and I solve it by using bedtools. the rationale is very simple:

1. extract a bed file per bam file with the regions where the bam file's positions are covered above a threshold (by generating a bedgraph file, filtering it by coverage, and merging all adjacent regions)
2. intersect all the bed files previously obtained

the commands needed would be these ones:

THRESHOLD=30
foreach bamfile in *bam; do
bedtoools genomecov -bga -ibam $bamfile \ | awk -v t=$THRESHOLD '$4>=t' \ | bedtools merge -i - \ >${bamfile/.bam/.above$THRESHOLD.bed} done cp ls *.above$THRESHOLD.bed | head -1 bed.tmp
for bedfile in *.above$THRESHOLD.bed; do bedtools intersect -a bed.tmp -b$bedfile > bed.tmp.2
mv bed.tmp.2 bed.tmp
done
mv bed.tmp above\$THRESHOLD.intersect.bed
0
Entering edit mode

Aha, I really like this solution! I think I'll give this a shot tonight

0
Entering edit mode
8.1 years ago
Dan D 7.3k

Off the top of my head, if I wanted to solve this problem quickly I would merge the BAMs (you could also use samtools merge) and then run Picard's CollectWgsMetrics tool. What you'll get at the end is a text file showing the number of bases at each coverage level, along with some other nice details.

0
Entering edit mode

This is an interesting idea, but how will I know if a base is covered in all 3 files? If a given position has 100 reads in the merged bam, is it possible to identify the extreme case where all of the reads are coming from the same sample?

0
Entering edit mode

For this new query, I would use bedtools genomecov with each of your BAMs (and perhaps with the merged BAM as well) to ascertain the per-base coverage at each reference position.

If you were interested in specific locations of the genome, I would put those locations in a BED file and use samtools to create a BAM for just those regions. Feed that truncated BAM into the above tool and you should get your results quickly.

0
Entering edit mode

Thanks for your response, and sorry if my original post was unclear!

Do you think the genomecov method is feasible with whole genomes?

0
Entering edit mode

It certainly is. The documentation for the tool details several ways you can have the output presented. I'm not sure exactly what you're wanting to get out of it at this point; read through the short documentation for that tool. If the available parameters don't meet your needs, you might be able to accomplish your goal by piping the output into a filtering script or sed/awk command. If you have any questions, post a comment on this answer or your question and I'll do what I can to help. Precise explanations of what you want will be helpful to that end :)

0
Entering edit mode

Yes, of course. I have 3 whole genomes (normal, pre, and post treatment) and want to know the number of bases covered to 10X in all 3 files. I am somewhat familiar with genomecov, and know that I can pipe the output to a simple awk script that will keep only those bases with coverage > X and subsequently use some FNR==NR magic to compare those files. I'm worried that the genomecov outputs will be prohibitively massive, however, even if I gzip them. I guess I was hoping for a nifty tool that could read bams and count covered bases simultaneously...

0
Entering edit mode

You can use joincommand in order to combine outputs from awk:

join <(... file1 | awk ...) <(... file2 | awk ...) | join - <(... file3 | awk)


This will avoid the need for storing output. If the column is not the first one, 'join' has options to set its number.