Question

Using RPK counts as DESeq2 input

0

Entering edit mode

6.9 years ago

adityabandla ▴ 30

Hi

I am analysing a metagenome dataset. To account for differences in gene length, I first divided the raw counts by the gene length, to end up with RPK (reads per kb). Can this matrix be used as an input to DESeq2 for normalisation and differential abundance analysis?

PS. No library size normalisation was done (which I intend to do using DESeq2)

deseq2 • 3.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 6.9 years ago by adityabandla ▴ 30

Ram · Answer 1 · 2017-05-21

1

Entering edit mode

6.9 years ago

Devon Ryan 104k

No, you can't use that as input into DESeq2. Unless your samples actually have different gene lengths (e.g., due to being different species), you would never even want to normalize for gene length, that'd only decrease your statistical power.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon

Thanks for the reply. I am confused, how that could possibly reduce statistical power. I do understand that one would lose statistical power, in case you rarefy the data (to end up with constant sums across samples).

In this case, I am merely dividing counts by the gene length and hence not discarding data. In case, the loss in statistical power is due to the reduction in the magnitude of counts, one can always counter that by multiplying all counts by a million

PS. In my datasets, the number of species is ~17k. One of my objectives, is to summarize the relative proportions of different gene families within samples and hence in this case gene length is a bias

Thinking about this, I might just have to use two different matrices for my objectives I suppose. The raw counts for differential abundance and the RPK matrix for summarizing sample relative proportions

ADD REPLY • link 6.9 years ago by adityabandla ▴ 30

0

Entering edit mode

Yes, power scales with the magnitude of the counts, but while multiplying by a million gets around that it scales the numbers beyond reason (you're getting more power by lying). In short, you gain absolutely nothing by dividing by gene length and instead are just causing problems for yourself.

For determining relative proportions you should use completely different methods. Will detection be affected by sequence composition? Yes. Will you have different 5'->3' bias or degradation in each sample? Probably. Will the effective gene length be different in each sample? Yes. Don't use RPKs, use something like Salmon to give you more reliable TPMs that account for the various biases and then use the tximport package to import things before using DESeq2.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the detailed reply. However, I am adopting an assembly-free approach first, in essence, blasting short reads against the NR database using DIAMOND. Hence, in this case Salmon doesn't work. The counts obtained using this approach are definitely biased by length. Interestingly, they suggest RPK in the humann2 pipeline

ADD REPLY • link 6.9 years ago by adityabandla ▴ 30

0

Entering edit mode

RPKM, FPKM and TPM are methods for normalizing reads that relies on gene length and deep of sequencing

But in this video, you can see that DESeq2 uses a different normalization method that needs the raw counts as input

ADD REPLY • link 6.9 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Hi Antonio

Thanks for the pointer! I do understand the way DESeq2 normalizes the data. However, dividing the raw counts by gene length is still "raw counts" since you are not normalizing for differences in library size. DESeq2 requires the "raw counts" that have not been corrected for library size differences

Best

ADD REPLY • link 6.9 years ago by adityabandla ▴ 30

1

Entering edit mode

I know this thread is old, but since I myself often get information from older biostars threads:

I am pretty sure you are messing up the statistics by doing that. Let's say we are looking at a particular gene with a count of 1000 in condition A and a count of 3000 in condition B. If we divide these numbers by 1000, we get counts of 1 for cond. A and 3 for cond. B. Now the difference between the 2 conditions just looks like noise, we won't detect differential expression.

ADD REPLY • link updated 10 months ago by Ram 43k • written 4.2 years ago by Tom ▴ 540

0

Entering edit mode

Tom can you explain more please? Because maybe it's really just noise if you got only 1000 reads if the gene length is about 1000 time the length of a read (which explain the division by 1000).

ADD REPLY • link 10 months ago by Jérémy • 0

0

Entering edit mode

Hi Devon,

I stumbled upon this thread, looking for an answer to a very similar question: I want to do DE analysis between closely related species (human and primates). Many genes will have different gene lengths in different species. Thus it seems that in this case, I need to include a normalization for the gene length (sum of exons length) in each species separately. What would be the best tool and best way to proceed? Shall I calculate FPKMs and feed it somehow into the Tuxedo suite pipeline? Or, some other tool can be used?

(Currently, I have mapped the data to the respective genomes, and also used featureCounts to generate read count tables.)

Thanks, Alex

ADD REPLY • link 6.0 years ago by Alex Kozlenkov • 0

0

Entering edit mode

I suspect that you could use a gene/species-specific offset. I suggest you go over to the bioconductor support site and ask your question there. Mike Love, the author of DESeq2, will reply and can give you the most appropriate advise.

ADD REPLY • link 6.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you,

In general, do you recommend employing DEseq2 for that kind of analysis, or have you heard of other tools that people successfully used with cross-species RNA-seq?

ADD REPLY • link 6.0 years ago by Alex Kozlenkov • 0

0

Entering edit mode

You could probably use limma too (or maybe edgeR), but I would again suggest asking on the bioconductor support site so the package authors can give you some input. Cross-species comparisons are plagued with difficulties (not just the length, but changes in sequence composition can also affect how sequencing efficiency) and not commonly performed, so getting advise directly from the tool author would be ideal.

ADD REPLY • link 6.0 years ago by Devon Ryan 104k