Question: Using RPK counts as DESeq2 input
gravatar for adityabandla
2.0 years ago by
adityabandla20 wrote:


I am analysing a metagenome dataset. To account for differences in gene length, I first divided the raw counts by the gene length, to end up with RPK (reads per kb). Can this matrix be used as an input to DESeq2 for normalisation and differential abundance analysis?

PS. No library size normalisation was done (which I intend to do using DESeq2)

deseq2 • 921 views
ADD COMMENTlink modified 2.0 years ago by Devon Ryan90k • written 2.0 years ago by adityabandla20
gravatar for Devon Ryan
2.0 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

No, you can't use that as input into DESeq2. Unless your samples actually have different gene lengths (e.g., due to being different species), you would never even want to normalize for gene length, that'd only decrease your statistical power.

ADD COMMENTlink written 2.0 years ago by Devon Ryan90k

Hi Devon

Thanks for the reply. I am confused, how that could possibly reduce statistical power. I do understand that one would lose statistical power, in case you rarefy the data (to end up with constant sums across samples).

In this case, I am merely dividing counts by the gene length and hence not discarding data. In case, the loss in statistical power is due to the reduction in the magnitude of counts, one can always counter that by multiplying all counts by a million

PS. In my datasets, the number of species is ~17k. One of my objectives, is to summarize the relative proportions of different gene families within samples and hence in this case gene length is a bias

Thinking about this, I might just have to use two different matrices for my objectives I suppose. The raw counts for differential abundance and the RPK matrix for summarizing sample relative proportions

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by adityabandla20

Yes, power scales with the magnitude of the counts, but while multiplying by a million gets around that it scales the numbers beyond reason (you're getting more power by lying). In short, you gain absolutely nothing by dividing by gene length and instead are just causing problems for yourself.

For determining relative proportions you should use completely different methods. Will detection be affected by sequence composition? Yes. Will you have different 5'->3' bias or degradation in each sample? Probably. Will the effective gene length be different in each sample? Yes. Don't use RPKs, use something like Salmon to give you more reliable TPMs that account for the various biases and then use the tximport package to import things before using DESeq2.

ADD REPLYlink written 2.0 years ago by Devon Ryan90k

Thanks for the detailed reply. However, I am adopting an assembly-free approach first, in essence, blasting short reads against the NR database using DIAMOND. Hence, in this case Salmon doesn't work. The counts obtained using this approach are definitely biased by length. Interestingly, they suggest RPK in the humann2 pipeline

ADD REPLYlink written 2.0 years ago by adityabandla20

RPKM, FPKM and TPM are methods for normalizing reads that relies on gene length and deep of sequencing

But in this video, you can see that DESeq2 uses a different normalization method that needs the raw counts as input

ADD REPLYlink written 2.0 years ago by Antonio R. Franco4.0k

Hi Antonio

Thanks for the pointer! I do understand the way DESeq2 normalizes the data. However, dividing the raw counts by gene length is still "raw counts" since you are not normalizing for differences in library size. DESeq2 requires the "raw counts" that have not been corrected for library size differences


ADD REPLYlink written 2.0 years ago by adityabandla20

Hi Devon,

I stumbled upon this thread, looking for an answer to a very similar question: I want to do DE analysis between closely related species (human and primates). Many genes will have different gene lengths in different species. Thus it seems that in this case, I need to include a normalization for the gene length (sum of exons length) in each species separately. What would be the best tool and best way to proceed? Shall I calculate FPKMs and feed it somehow into the Tuxedo suite pipeline? Or, some other tool can be used?

(Currently, I have mapped the data to the respective genomes, and also used featureCounts to generate read count tables.)

Thanks, Alex

ADD REPLYlink written 12 months ago by Alex Kozlenkov0

I suspect that you could use a gene/species-specific offset. I suggest you go over to the bioconductor support site and ask your question there. Mike Love, the author of DESeq2, will reply and can give you the most appropriate advise.

ADD REPLYlink written 12 months ago by Devon Ryan90k

Thank you,

In general, do you recommend employing DEseq2 for that kind of analysis, or have you heard of other tools that people successfully used with cross-species RNA-seq?

ADD REPLYlink written 12 months ago by Alex Kozlenkov0

You could probably use limma too (or maybe edgeR), but I would again suggest asking on the bioconductor support site so the package authors can give you some input. Cross-species comparisons are plagued with difficulties (not just the length, but changes in sequence composition can also affect how sequencing efficiency) and not commonly performed, so getting advise directly from the tool author would be ideal.

ADD REPLYlink written 12 months ago by Devon Ryan90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 670 users visited in the last hour