Question

Using ComBat-seq on transcript counts

2

Entering edit mode

3.5 years ago

markddesimone ▴ 60

I am working with transcript counts produced by RSEM which gives me expected_count, TPM and FPKM values. I usually work with TPM values as the counts have been normalized for transcript length. I would like to use ComBat-seq for batch effect removal. The documentation https://github.com/zhangyuqing/ComBat-seq says ComBat-seq requires

untransformed, raw count matrix as input

It also says:

ComBat-seq provides adjusted data which preserves the integer nature of counts.

Since none of the counts produced by RSEM are integer, I'm not clear on what ComBat-seq is asking me to provide. It would seem that TPM would be appropriate as transcript length has been taken into account but the 'integer' part brings that into question.

Can anyone provide clarity on what I should pass into ComBat-Seq? thank you

RNA-Seq • 5.0k views

ADD COMMENT • link updated 20 months ago by Gordon Smyth ★ 7.0k • written 3.5 years ago by markddesimone ▴ 60

score 4 · Answer 1 · 2020-11-09

4

Entering edit mode

3.5 years ago

Gordon Smyth ★ 7.0k

ComBat-seq uses edgeR under the hood, which is able to handle fractional counts such as those from RSEM. You need to use the RSEM expected counts. There is no need to round them to exact integers, in fact rounding will lose some information.

You absolutely cannot use TPM or FPKM.

ADD COMMENT • link 20 months ago by Gordon Smyth ★ 7.0k

1

Entering edit mode

Does ComBat-seq work on transcript level counts rather than the gene level, so does the mapping uncertainty play a role here?

ADD REPLY • link 3.5 years ago by ATpoint 82k

0

Entering edit mode

The ComBat-seq paper only mentions gene-level counts. I would guess that mapping uncertaintly would be a major issue for transcript level counts, and that ComBat-seq is only designed for gene level counts, but the ComBat-seq authors would have to confirm.

It isn't clear to me whether OP has gene level or transcript level counts. The question says "transcript counts" but also mentions "gene lengths".

ADD REPLY • link 3.5 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thank you for the response. I am using Transcript counts, sorry, I should have said transcript length and have now edited the post to update that.

I am interested in the relative expression of transcripts within and between samples, which is why I was using TPM since the relative expression of transcripts within the sample will have been normalized in the TPM (i.e. expected_count / effective_length). My counts were also measured in different technologies, some via NanoString which counts the existence of a particular sequence, and RNA-Seq where multiple reads map into a single transcript and therefore are amplified by expected_length. My current pipeline takes TPM values from RNA-Seq and NanoString counts and normalizes them together using geometric means as in DESeq2. I then wanted to use ComBat-Seq to correct for batch effects. If TPM is out of the question for ComBat-Seq do you have any suggestions how to unify these data? Should I amplify my NanoString counts by effective_length to simulate expected_counts and pass these to ComBat-Seq?

ADD REPLY • link 3.5 years ago by markddesimone ▴ 60

0

Entering edit mode

Hello, I'm going through the same issue with RNASeq and Nanostring data. If you find any solution for this, do you mind sharing it with me? Thanks in advance.

ADD REPLY • link 3.0 years ago by szutre ▴ 10

score 0 · Answer 2 · 2020-11-09

0

Entering edit mode

3.5 years ago

swbarnes2 14k

TPM is a transformation, and it doesn't want that. Rounding RSEM expected_counts is appropriate

ADD COMMENT • link 3.5 years ago by swbarnes2 14k

score 0 · Answer 3 · 2020-11-09

ComBat-Seq takes input as raw un-normalized data as input and addresses the batch effects using a negative binomial regression model. You can use featureCounts (https://academic.oup.com/bioinformatics/article/30/7/923/232889) or htseq-count (https://htseq.readthedocs.io/en/release_0.11.1/count.html) to get raw un-normalized counts.