Question: Normalize bedtools gene coverage by total number of reads
1
gravatar for David
17 months ago by
David120
David120 wrote:

Hi, I have run a Hiseq 2x150bp from metagenomics samples.

I have computed the gene coverage for a list of genes for a given sample. For each sample i take a subset of genes i´m interested in and mapped all reads to each of these genes generating a bam file that I use as input to bedtools as follows

bedtools genomecov -ibam subsetgenes_sample_A.bam
bedtools genomecov -ibam subsetgenes_sample_B.bam
bedtools genomecov -ibam subsetgenes_sample_C.bam
....

Here below the output for one sample. I have added column names for clarity.

           V1           V2      V3       V4        V5
        k141_25_1       1       64      165     0.387879
        k141_25_1       2       100     165     0.606061
        k141_25_1       3       1       165     0.00606061
        k141_34_1       2       11      726     0.0151515
        k141_34_1       3       55      726     0.0757576
        k141_34_1       4       36      726     0.0495868
        k141_34_1       5       101     726     0.139118
        k141_34_1       6       123     726     0.169421
        k141_34_1       7       195     726     0.268595

In order to compute gene coverage i run coverage = (V2*V3)/V4. The problem i have is that i have several samples so ideally i should normalize each sample by the total number of reads and multiply by 1M.

Would it make sense to normalize as follows normalized_genes = coverage(from above formula)/totalReadspersample * 1000000

Thanks for comments.

ADD COMMENTlink modified 16 months ago • written 17 months ago by David120

Thanks for your comments Carlo, Can you generate coverage using FeatureCounts from bam files ??

ADD REPLYlink written 16 months ago by David120

Ok found it coverageCounts. Great I´ll normalize with Deseq2 , thanks

ADD REPLYlink written 16 months ago by David120

Hi, Couldn´t manage to get genes coverage using coverageCounts. It returns binary files. How would you get the gene coverage as numeric ?

ADD REPLYlink written 16 months ago by David120

You can get the raw counts for your set of specific genes when you have Bam or SAM files. Later you can load the raw counts inot R and normalize using DESeq2 or EdgeR

ADD REPLYlink written 16 months ago by EVR480
1
gravatar for Carlo Yague
16 months ago by
Carlo Yague4.2k
Belgium
Carlo Yague4.2k wrote:

Hi,

It looks like you want to compare the expression of a few genes between conditions. While it is a good start, I think there are a few issues with your method :

  1. bedtools is very useful but more specific tools, such as HTseqCounts or featureCounts, are better to count reads on genes. This involves less steps than with your method and it takes care properly of ambiguous reads.
  2. RPKM is a poor 'between samples' normalization method. Most people here recommend either DESeq2 or edgeR instead.

Hope this helps,

Carlo

ADD COMMENTlink modified 16 months ago • written 16 months ago by Carlo Yague4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 821 users visited in the last hour