Question

Processing bulk RNA-seq TMM matrix to get DEGs

0

Entering edit mode

4.0 years ago

Dataminer ★ 2.8k

Dear Community,

I am using a publicly available RNA-seq datasets that were processed as follows:

"Gene expression normalizations were performed using the TMM method (PMID: 20196867), and further normalization was applied by adjusting the expression to gene length. In addition, only the genes and that had reads mapped to them in at least 5% of the samples were kept."

Now, if I had a raw read counts table I could easily follow DESeq/EdgeR/Limma but with a TMM normalized table what method shall I follow and what should be my starting point to induct this table in the workflow.

I am certain some of you must have encountered this problem, could please share your knowledge.

Many thanks in advance.

RNA-Seq • 1.4k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 81k • written 4.0 years ago by Dataminer ★ 2.8k

0

Entering edit mode

Gene expression normalizations were performed using the TMM method (PMID: 20196867), and further normalization was applied by adjusting the expression to gene length.

afaik adjusting the expression to gene length (edaseq, cqn) should be performed before the TMM normalization and not after. I would not use these data.

'adjusting the expression to gene length' this kind of things drives me crazy. They dhould say how the data were adjusted.

ADD REPLY • link 4.0 years ago by andres.firrincieli 3.6k

score 3 · Answer 1 · 2020-04-07

3

Entering edit mode

4.0 years ago

ATpoint 81k

That is a common problem and I will never understand why authors of a paper provide normalized data instead of a raw count table. The thing is that tools like edgeR do not directly model the normalized counts but use the normalization factors to produce offsets for their GLM models which are based on the raw counts. That having said, no I do not think that you can use the normalized counts for edgeR. Either get raw counts or download the raw fastq files from that paper or use something like limma. There are a couple of threads where the limma authors talk about using normalized counts as input for the tool, but this is far from optimal. The point with normalizing for gene length is that this reduces the counts and therefore the power of longer versus shorter genes, therefore you typically do not do that in diff. analysis. As said, try to get raw data. If this is not possible please browse the Bioconductor support forum for discussions towards using normalized counts for limma. Check for threads where the PI of the limma/edgeR projects (Gordon Smyth) gave his expert opinion.

ADD COMMENT • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

Thank you for your reply. I concur, that raw reads are more useful than these normalised read counts. I was hoping for a work around but it seems that there isn't any.

ADD REPLY • link 4.0 years ago by Dataminer ★ 2.8k

0

Entering edit mode

I agree raw counts should be available, but normalized counts are useful too. That way, you are looking at the same data as the authors. Also, why force others to reprocess the data?

ADD REPLY • link 4.0 years ago by igor 13k

0

Entering edit mode

Because normalized counts do not allow to run one of the most essential of all analysis: differential testing. I personally would either provide both or provide the raw counts (and maybe an offset matrix if using something like tximport, CQN or similar methods) and the script for processing, this is just a few lines of code but would make prevent users like OP from the need to start from raw fastq files.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

My vote is for both files. I just meant that the normalized counts can be useful too. For most people, a few lines of code make the data completely inaccessible.

ADD REPLY • link 4.0 years ago by igor 13k