Question: Processing bulk RNA-seq TMM matrix to get DEGs
0
gravatar for Dataminer
7 months ago by
Dataminer2.7k
Netherlands
Dataminer2.7k wrote:

Dear Community,

I am using a publicly available RNA-seq datasets that were processed as follows:

"Gene expression normalizations were performed using the TMM method (PMID: 20196867), and further normalization was applied by adjusting the expression to gene length. In addition, only the genes and that had reads mapped to them in at least 5% of the samples were kept."

Now, if I had a raw read counts table I could easily follow DESeq/EdgeR/Limma but with a TMM normalized table what method shall I follow and what should be my starting point to induct this table in the workflow.

I am certain some of you must have encountered this problem, could please share your knowledge.

Many thanks in advance.

rna-seq • 219 views
ADD COMMENTlink modified 7 months ago by ATpoint42k • written 7 months ago by Dataminer2.7k

Gene expression normalizations were performed using the TMM method (PMID: 20196867), and further normalization was applied by adjusting the expression to gene length.

afaik adjusting the expression to gene length (edaseq, cqn) should be performed before the TMM normalization and not after. I would not use these data.

'adjusting the expression to gene length' this kind of things drives me crazy. They dhould say how the data were adjusted.

ADD REPLYlink modified 7 months ago • written 7 months ago by andres.firrincieli1.0k
2
gravatar for ATpoint
7 months ago by
ATpoint42k
Germany
ATpoint42k wrote:

That is a common problem and I will never understand why authors of a paper provide normalized data instead of a raw count table. The thing is that tools like edgeR do not directly model the normalized counts but use the normalization factors to produce offsets for their GLM models which are based on the raw counts. That having said, no I do not think that you can use the normalized counts for edgeR. Either get raw counts or download the raw fastq files from that paper or use something like limma. There are a couple of threads where the limma authors talk about using normalized counts as input for the tool, but this is far from optimal. The point with normalizing for gene length is that this reduces the counts and therefore the power of longer versus shorter genes, therefore you typically do not do that in diff. analysis. As said, try to get raw data. If this is not possible please browse the Bioconductor support forum for discussions towards using normalized counts for limma. Check for threads where the PI of the limma/edgeR projects (Gordon Smyth) gave his expert opinion.

ADD COMMENTlink modified 7 months ago • written 7 months ago by ATpoint42k

Thank you for your reply. I concur, that raw reads are more useful than these normalised read counts. I was hoping for a work around but it seems that there isn't any.

ADD REPLYlink modified 7 months ago • written 7 months ago by Dataminer2.7k

I agree raw counts should be available, but normalized counts are useful too. That way, you are looking at the same data as the authors. Also, why force others to reprocess the data?

ADD REPLYlink written 7 months ago by igor11k

Because normalized counts do not allow to run one of the most essential of all analysis: differential testing. I personally would either provide both or provide the raw counts (and maybe an offset matrix if using something like tximport, CQN or similar methods) and the script for processing, this is just a few lines of code but would make prevent users like OP from the need to start from raw fastq files.

ADD REPLYlink modified 7 months ago • written 7 months ago by ATpoint42k

My vote is for both files. I just meant that the normalized counts can be useful too. For most people, a few lines of code make the data completely inaccessible.

ADD REPLYlink written 7 months ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2126 users visited in the last hour