Question: Normalize bedtools gene coverage by total number of reads
gravatar for David
3.8 years ago by
David190 wrote:

Hi, I have run a Hiseq 2x150bp from metagenomics samples.

I have computed the gene coverage for a list of genes for a given sample. For each sample i take a subset of genes i´m interested in and mapped all reads to each of these genes generating a bam file that I use as input to bedtools as follows

bedtools genomecov -ibam subsetgenes_sample_A.bam
bedtools genomecov -ibam subsetgenes_sample_B.bam
bedtools genomecov -ibam subsetgenes_sample_C.bam

Here below the output for one sample. I have added column names for clarity.

           V1           V2      V3       V4        V5
        k141_25_1       1       64      165     0.387879
        k141_25_1       2       100     165     0.606061
        k141_25_1       3       1       165     0.00606061
        k141_34_1       2       11      726     0.0151515
        k141_34_1       3       55      726     0.0757576
        k141_34_1       4       36      726     0.0495868
        k141_34_1       5       101     726     0.139118
        k141_34_1       6       123     726     0.169421
        k141_34_1       7       195     726     0.268595

In order to compute gene coverage i run coverage = (V2*V3)/V4. The problem i have is that i have several samples so ideally i should normalize each sample by the total number of reads and multiply by 1M.

Would it make sense to normalize as follows normalized_genes = coverage(from above formula)/totalReadspersample * 1000000

Thanks for comments.

ADD COMMENTlink modified 3.7 years ago • written 3.8 years ago by David190

Thanks for your comments Carlo, Can you generate coverage using FeatureCounts from bam files ??

ADD REPLYlink written 3.7 years ago by David190

Ok found it coverageCounts. Great I´ll normalize with Deseq2 , thanks

ADD REPLYlink written 3.7 years ago by David190

Hi, Couldn´t manage to get genes coverage using coverageCounts. It returns binary files. How would you get the gene coverage as numeric ?

ADD REPLYlink written 3.7 years ago by David190

You can get the raw counts for your set of specific genes when you have Bam or SAM files. Later you can load the raw counts inot R and normalize using DESeq2 or EdgeR

ADD REPLYlink written 3.7 years ago by EVR570
gravatar for Carlo Yague
3.7 years ago by
Carlo Yague5.5k
Carlo Yague5.5k wrote:


It looks like you want to compare the expression of a few genes between conditions. While it is a good start, I think there are a few issues with your method :

  1. bedtools is very useful but more specific tools, such as HTseqCounts or featureCounts, are better to count reads on genes. This involves less steps than with your method and it takes care properly of ambiguous reads.
  2. RPKM is a poor 'between samples' normalization method. Most people here recommend either DESeq2 or edgeR instead.

Hope this helps,


ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Carlo Yague5.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1823 users visited in the last hour