Using FPKM and TPM values for batch correction for Single Cell RNA-Seq
2
0
Entering edit mode
2.0 years ago
hkarakurt ▴ 110

Hello, We are trying to analyze a set of single cell data sets from different sources but we have a problem. One of the data set is in TPM and another one is in FPKM format. It is easy to do batch correction with raw counts (with CCA in Seurat or MNN in Scater) but we have no idea how to deal with this problem.

Do you think we can use TPM and FPKM values for batch correction since they are already normalized. Another option is to convert values to raw counts but we have no idea how to do it.

RNA-Seq scRNA-seq batch correction fpkm tpm • 1.6k views
1
Entering edit mode
2.0 years ago
ATpoint 54k

Both are not suited for differential (or any other inter-sample) comparison. Please use google and the search function. FPKM/TPM as normalization technique (and why it is a poor choice) has been discussed many times before. Also please check previous threads on batch correction. You always want identical in silico processing of data to avoid confounding effects and you always want data normalized together, not independelty.

0
Entering edit mode

I am planning to check tximport package to convert FPKM and TPM values to counts. I found the function. I think you mean that one.

0
Entering edit mode

No I did not. Sorry to say but what you say does not make any sense. tximport is meant to convert transcript abundance estimates to the gene level while correcting for the different lengths of the transcripts which influence the abundances (longer transcripts => higher abundances). As I said neither TPM nor FPKM are unsuited for intersample comparisons. Are these at least two datasets from the same study/lab or two completely different datasets?

0
Entering edit mode

They are completely different data sets from same biological sample

0
Entering edit mode

Then it is probably impossible to do what you aim. See essentially: C: Comparison between scRNA and bulk RNA which should cover the main arguments regardless of the dataset being bulk or single-cell. Most importantly points 2 and 4.

1
Entering edit mode
2.0 years ago
shoujun.gu ▴ 360

From my experience:

1. Do not use FPKM/TPM at any situation.
2. Do not use CCA, unless they are technical repeats (or something like this).
3. If two datasets differs a lot, and you still believe there are common populations within these dataset, then MNN maybe helpful. But only use it for clustering.
0
Entering edit mode

Thank you for your answer. My problem here is I do not have any raw count data. I am using public data sets and they provided only FPKM/TPM values. I was using MNN before but I have never been in a situation like this (data with only FPKM data)

0
Entering edit mode

If its a published paper, the raw data should be deposited to SRA. But if it is from some database, you may not able to get the raw counts. Of course you can always use any algorithm on any type of data. People don't use FPKM for MNN just because they don't think this analysis will give you any reliable results.