Question

How to deal with batch effects of multiple datasets?

0

Entering edit mode

2.2 years ago

JACKY ▴ 140

I combined multiple datasets into one. The datasets are bulk RNA-seq data regarding samples of primary cancer vs metastatic cancer. Now I have all the counts in one dataframe, and all the metadata in one dataframe also. I want to run a DESeq2 analysis of the two groups, and of course I want to do design = condition, because I want the results to be only according to the cancer condition if it's primary or metastatic. The probelem is I am getting reults that are being affected by the datasets. When doing PCA for example, each dataset clusters alone, which is not right. I have 7 datasets overall and I dont want the source (the dataset) to affect the resuls.

Should I adjust the design in DESeq ? should I use RUVseq ? I'm a bit lost

DESeq2 r • 919 views

ADD COMMENT • link updated 2.2 years ago by swbarnes2 14k • written 2.2 years ago by JACKY ▴ 140

score 0 · Answer 1 · 2022-03-12

0

Entering edit mode

2.2 years ago

ATpoint 82k

each dataset clusters alone

Yes, that is normal and expeced in RNA-seq which is strongly affected by the RNA extraction method, RNA integrity and kit used for library prep. You most likely cannot compare independent datasets, that's just how it is. It is somewhat wishful thinking that one can simply pull random datasets from GEO and then expect them to be comparable -- they're not.

ADD COMMENT • link 2.2 years ago by ATpoint 82k

0

Entering edit mode

So this can't be done? Can't limma handle this kind of problem?

ADD REPLY • link 2.2 years ago by JACKY ▴ 140

0

Entering edit mode

You have 7 datasets, and each is from a different study?

ADD REPLY • link 2.2 years ago by ATpoint 82k

0

Entering edit mode

Yes.. can't I add the dataset number to the DESeq design or something ?

ADD REPLY • link 2.2 years ago by JACKY ▴ 140

0

Entering edit mode

Yes, but that only works if batch is not confounded with condition.

ADD REPLY • link 2.2 years ago by swbarnes2 14k