Question: Using RPK counts as DESeq2 input
gravatar for adityabandla
3.0 years ago by
adityabandla30 wrote:


I am analysing a metagenome dataset. To account for differences in gene length, I first divided the raw counts by the gene length, to end up with RPK (reads per kb). Can this matrix be used as an input to DESeq2 for normalisation and differential abundance analysis?

PS. No library size normalisation was done (which I intend to do using DESeq2)

deseq2 • 1.3k views
ADD COMMENTlink modified 3.0 years ago by Devon Ryan95k • written 3.0 years ago by adityabandla30
gravatar for Devon Ryan
3.0 years ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

No, you can't use that as input into DESeq2. Unless your samples actually have different gene lengths (e.g., due to being different species), you would never even want to normalize for gene length, that'd only decrease your statistical power.

ADD COMMENTlink written 3.0 years ago by Devon Ryan95k

Hi Devon

Thanks for the reply. I am confused, how that could possibly reduce statistical power. I do understand that one would lose statistical power, in case you rarefy the data (to end up with constant sums across samples).

In this case, I am merely dividing counts by the gene length and hence not discarding data. In case, the loss in statistical power is due to the reduction in the magnitude of counts, one can always counter that by multiplying all counts by a million

PS. In my datasets, the number of species is ~17k. One of my objectives, is to summarize the relative proportions of different gene families within samples and hence in this case gene length is a bias

Thinking about this, I might just have to use two different matrices for my objectives I suppose. The raw counts for differential abundance and the RPK matrix for summarizing sample relative proportions

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by adityabandla30

Yes, power scales with the magnitude of the counts, but while multiplying by a million gets around that it scales the numbers beyond reason (you're getting more power by lying). In short, you gain absolutely nothing by dividing by gene length and instead are just causing problems for yourself.

For determining relative proportions you should use completely different methods. Will detection be affected by sequence composition? Yes. Will you have different 5'->3' bias or degradation in each sample? Probably. Will the effective gene length be different in each sample? Yes. Don't use RPKs, use something like Salmon to give you more reliable TPMs that account for the various biases and then use the tximport package to import things before using DESeq2.

ADD REPLYlink written 3.0 years ago by Devon Ryan95k

Thanks for the detailed reply. However, I am adopting an assembly-free approach first, in essence, blasting short reads against the NR database using DIAMOND. Hence, in this case Salmon doesn't work. The counts obtained using this approach are definitely biased by length. Interestingly, they suggest RPK in the humann2 pipeline

ADD REPLYlink written 3.0 years ago by adityabandla30

RPKM, FPKM and TPM are methods for normalizing reads that relies on gene length and deep of sequencing

But in this video, you can see that DESeq2 uses a different normalization method that needs the raw counts as input

ADD REPLYlink written 3.0 years ago by Antonio R. Franco4.5k

Hi Antonio

Thanks for the pointer! I do understand the way DESeq2 normalizes the data. However, dividing the raw counts by gene length is still "raw counts" since you are not normalizing for differences in library size. DESeq2 requires the "raw counts" that have not been corrected for library size differences


ADD REPLYlink written 3.0 years ago by adityabandla30

I know this thread is old, but since i myself often get information from older biostars-threads:

I am pretty sure you are messing up the statistics by doing that. Let's say we are looking at a particular gene with a count of 1000 in condition A and a count of 3000 in condition B. If we divide these numbers by 1000, we get counts of 1 for cond. A and 3 for cond. B. Now the difference between the 2 conditions just looks like noise, we won't detect differential expression.

ADD REPLYlink written 3 months ago by Tom520

Hi Devon,

I stumbled upon this thread, looking for an answer to a very similar question: I want to do DE analysis between closely related species (human and primates). Many genes will have different gene lengths in different species. Thus it seems that in this case, I need to include a normalization for the gene length (sum of exons length) in each species separately. What would be the best tool and best way to proceed? Shall I calculate FPKMs and feed it somehow into the Tuxedo suite pipeline? Or, some other tool can be used?

(Currently, I have mapped the data to the respective genomes, and also used featureCounts to generate read count tables.)

Thanks, Alex

ADD REPLYlink written 2.1 years ago by Alex Kozlenkov0

I suspect that you could use a gene/species-specific offset. I suggest you go over to the bioconductor support site and ask your question there. Mike Love, the author of DESeq2, will reply and can give you the most appropriate advise.

ADD REPLYlink written 2.1 years ago by Devon Ryan95k

Thank you,

In general, do you recommend employing DEseq2 for that kind of analysis, or have you heard of other tools that people successfully used with cross-species RNA-seq?

ADD REPLYlink written 2.1 years ago by Alex Kozlenkov0

You could probably use limma too (or maybe edgeR), but I would again suggest asking on the bioconductor support site so the package authors can give you some input. Cross-species comparisons are plagued with difficulties (not just the length, but changes in sequence composition can also affect how sequencing efficiency) and not commonly performed, so getting advise directly from the tool author would be ideal.

ADD REPLYlink written 2.1 years ago by Devon Ryan95k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour