Question: From per bp coverage to average coverage in 10bp intervals
gravatar for Adrian Pelin
17 months ago by
Adrian Pelin2.4k
Adrian Pelin2.4k wrote:

Hello, I have per bp coverage I got from bedtools for the entire human genome using:

bedtools genomecov -ibam BAM-FILE.sort.grp.bam -d > BAM-FILE.coverage_per.bp

It looks something like this:

chr12  1   106
chr12  2   103
chr12  3   100
chr12  4   101
chr12  5   99
chr12  6   90
chr12  7   88
chr12  8   89
chr12  9   84
chr12  10  79
chr12  11  72
chr12  12  65
chr12  13  66
chr12  14  64
chr12  15  63
chr12  16  60
chr12  17  53
chr12  18  52
chr12  19  51
chr12  20  49

So for every bp position in the 3Gb genome there is a coverage values. I would like an average values for windows of either 10, 50 or 100bp. So for the above example in windows of 10bp expected output would be:

chr12   1   10  93.9
chr12   11  20  59.5

There must be a really easy way to do this and I am struggling to code a shell script for this.

bam dna-seq genome • 618 views
ADD COMMENTlink modified 17 months ago by Alex Reynolds31k • written 17 months ago by Adrian Pelin2.4k
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

The file answer.bed will have the mean signal over 10-base increments over the bounds defined by the coverage region.

$ awk -vFS="\t" -vOFS="\t" '{ print $1, $2, ($2 + 1), ".", $3 }' coverage.txt | sort-bed - > coverage.bed5
$ bedops --merge coverage.bed5 | bedops --chop 10 - | bedmap --echo --mean --delim '\t' - coverage.bed5 > answer.bed
ADD COMMENTlink written 17 months ago by Alex Reynolds31k

This is a very neat approach indeed, thanks Alex! Testing out on mouse genome now.

ADD REPLYlink written 17 months ago by Adrian Pelin2.4k

In theory, if you were interested in doing this for the entire genome for multiple samples, you wouldn't need to do the "bedops --merge coverage.bed5 | bedops --chop 10 -" right? You could get the chopped file from a fasta file, if you had the length of each chromosome.

ADD REPLYlink written 17 months ago by Adrian Pelin2.4k

Yep! In fact, there is a handy tool called fetchChromSizes in the UCSC Kent utilities kit for doing this, without FASTA. You might look at this answer, replacing hg38 with mm10, or whichever mouse assembly you're using, and modifying the chop parameter: A: Is it possible to generate a BED/BEDGRAPH containing every dinucleotide in the g

Instead of chopping up a genome each time you want to do this for each sample, you might chop it once. Then, for each sample, specify it as a file in the bedmap statement, replacing the hyphen with the filename of the chopped regions.

ADD REPLYlink modified 17 months ago • written 17 months ago by Alex Reynolds31k

Indeed, I was using read_fasta from biopieces to get contig sizes. One follow up question. Is there a simpler way to get average coverage from a bam file for 10bp windows? rather than getting a per base coverage with bedtools genomecov, which produces a massive 50+GB file for mouse/human genomes and then processing it your way, which works but seems that it needs a lot of space.

ADD REPLYlink written 17 months ago by Adrian Pelin2.4k

You could use a streaming approach, combining piping with bash process substitutions, e.g.:

$ bedmap --echo --count --delim '\t' bins.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain the bin and the number of reads which overlap that bin. This is pretty simplistic. You could do more here to do sliding windows or only map a read to one bin, to avoid multiple counts, etc. But the idea is to use a simple example to demonstrate input/output streams as a way to avoid intermediate files.

The file bins.bed are the 10nt bins generated from chopping fetchChromSizes or by merging and chopping contigs.

These bins can come from an upstream process, to avoid creating intermediate files:

$ bedops --merge contigs.bed | bedops --chop 10 - | bedmap --echo --count --delim '\t' - <(bam2bed < reads.bam) > answer.bed

However, if you have lots of bam files, it probably makes more sense (saves more time) to generate bins.bed once and reuse it. Or use a compressed equivalent, described below.

The BEDOPS kit includes a compression format called Starch that can compress BED files to about 3-4% of their original size:

$ bedops --merge contigs.bed | bedops --chop 10 - | starch --omit-signature - > bins.starch

You can use Starch files in BEDOPS operations directly. For instance, here I am replacing the file bins.bed with its compressed equivalent bins.starch:

$ bedmap --echo --count --delim '\t' bins.starch <(bam2bed < reads.bam) > answer.bed

If you have to have an intermediate BED file, then Starch can be a very useful way to save disk space, and still do the same operations.

ADD REPLYlink modified 17 months ago • written 17 months ago by Alex Reynolds31k

Excellent, this is a wonderful speedier alternative, thanks. Two comments. First, starch for some reason omits a contigs name past a dot character ".", ex (hEIF4E2.Homo_sapiens became hEIF4E2). Perhaps this was intended as such, however for custom genome assemblies and contig naming this may be an issue. Second, the approach using bedtools genomecov uses a --mean parameter for bedops while the bam2bed approach uses a --count argument. I tried using --mean with bam2bed but that produced a lot of "NAN" and averages that made no sense. When compare genomecov mean and bam2bed count with the same interval file, there are consistent in where they determine coverage, however not always on the actual values. For instance a genomecov mean can be 1.100000 while a bam2bed count can be 2, a 6.180000 a 14 and so on. Obviously they use different strategies, I am wondering where does the difference lie in.

ADD REPLYlink written 17 months ago by Adrian Pelin2.4k

I think I get it, the second approach, bedops counts amount of reads overlapping by at least 1bp in the intervals specified, so it's not really a mean coverage per bp in the interval but rather amount of reads. So a read can potentially increase the count of two different intervals. Works as well.

ADD REPLYlink written 17 months ago by Adrian Pelin2.4k

Bam2bed only converts reads to BED. It does not do any statistical summarization like coverage.

This is where downstream operations with bedmap --count come in, which counts raw reads over intervals.

Use of a summary operation like --mean makes sense for coverage signal (which has already applied some kind of counting op), but it does not make sense for raw reads.

If a read falls over multiple intervals, it will be counted over each interval without some extra logic. I can suggest an approach that assigns a read to only one bin, if you like.

ADD REPLYlink written 17 months ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1453 users visited in the last hour