Question

Counts table as edgeR input for differential gene expression

0

Entering edit mode

4.0 years ago

Oli • 0

Hello community, sorry for the dumb question, but I'm just a novice. I downloaded a file with the reads counts from a previously published paper and run DEG with edgeR after following some tutorials and reading the user guide. But I have a problem: all the tutorials I followed started from a table with raw counts composed of integers, while the table from the paper has decimal numbers. I read somewhere that edgeR performs its own normalisation so raw counts should be used, is that true? Neither the file headers nor the supplementary mention any kind of normalisation (the file is just called processed), so what am I handling? I thought processes refers to the fact that the reads where aligned and so on. Is it possible to have raw counts with decimal numbers? Is the analysis still reliable or am i working on some sort of normalisation that messes up my analysis?

Thank you very much in advance!

( This is how may reads counts appear https://imgur.com/1wwexTR )

RNA-Seq • 3.5k views

ADD COMMENT • link updated 4.0 years ago by swbarnes2 14k • written 4.0 years ago by Oli • 0

score 1 · Answer 1 · 2020-05-14

1

Entering edit mode

4.0 years ago

swbarnes2 14k

Is it possible to have raw counts with decimal numbers?

It's actually possible; RSEM will return expected counts as fractions, because it assigns ambiguously assigned reads fractionally to all the places it thinks that read might have come from. Rounded RSEM expected counts are acceptable to use.

But in a set of counts from RSEM, there will be some genes which have integer counts, because there will be some genes where counts can be assigned unambiguously. So I don't think you have that.

You should assume that's normalized data, and not suitable for EdgeR

ADD COMMENT • link 4.0 years ago by swbarnes2 14k

0

Entering edit mode

thank you very much for the clear explanation. Let me ask you one more question please: does the same problem apply to DESeq too?

ADD REPLY • link 4.0 years ago by Oli • 0

0

Entering edit mode

What you describe are transcript abundance estimates and not counts. These would need to be aggregated to the gene level and this in turn would produce integers again, right?

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

I don't know, I'm a novice so my knowledge is still severely lacking! you mean there's a way to reconvert them into raw counts?

ADD REPLY • link 4.0 years ago by Oli • 0

0

Entering edit mode

No. I was referring to swbarnes comment. Does not apply to your situation.

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

RSEM can output gene level "expected_count". RSEM can split a read's gene assignment probabilisticly among multiple genes, if it can't be uniquely assigned to a single gene. So some genes will have every count belonging 100% to that gene, but some will not, and those genes will have non-integer expected counts.

DESeq2 can import RSEM output

https://support.bioconductor.org/p/94003/#94028

But again, if the OP has no integer counts at all, then that's not what s/he has.

ADD REPLY • link 4.0 years ago by swbarnes2 14k

score 0 · Answer 2 · 2020-05-14

0

Entering edit mode

4.0 years ago

ATpoint 82k

Is it possible to have raw counts with decimal numbers?

Usually not, no. What you have is probably somekind of normalized data.

Is the analysis still reliable or am i working on some sort of normalisation that messes up my analysis?

If you use edgeR then it will try to re-normalize these already normalized values and this will produce nonsense results. An examplanation why edgeR is not compatible with already normalized counts can be found in the manual section 2.8.6. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf The point is that the normalized counts are not directly. The calculated size factors are used as offsets for the GLMs which is based on the raw counts.

The safest option would be to download the raw data from NCBI, align/quantify them as described in the standard workflows you find e.g. on Bioconductor. Or email authors and ask for a table of raw counts, that would be the fastest/easiest option.

If all of that is not possible please google for the limma-trend pipeline. There are multiple threads on Bioconductor that describe use of limma-trend on already normalized counts, and also why this is not optimal.

ADD COMMENT • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

Thank you sincerely for your response, it helped me greatly. I really lack the proper knowledge to align and quantify reads from raw data so im going to reconsider my whole approach and try to email the authors just in case.

ADD REPLY • link 4.0 years ago by Oli • 0

0

Entering edit mode

Hello, Is what you are saying, regarding the presence of decimals in count tables, true for every case? I am wondering because I thought of the count table output when using featureCounts and allowing it to count multi-mapping reads/fragments (-M and --fraction options). Some values, in this case, are decimals and although some normalization is taking place, I would think it's not as drastic as when you normalize for gene length and sequencing depth.
Thank you!

ADD REPLY • link 4.0 years ago by fmerkal ▴ 60

0

Entering edit mode

I cannot guarantee it is always true. For tools like featureCounts I would expect it even though I am not familiar with the options for featureCounts you mention.

ADD REPLY • link 4.0 years ago by ATpoint 82k