How to deal with batch effects of multiple datasets?
1
0
Entering edit mode
2.2 years ago
JACKY ▴ 140

I combined multiple datasets into one. The datasets are bulk RNA-seq data regarding samples of primary cancer vs metastatic cancer. Now I have all the counts in one dataframe, and all the metadata in one dataframe also. I want to run a DESeq2 analysis of the two groups, and of course I want to do design = condition, because I want the results to be only according to the cancer condition if it's primary or metastatic. The probelem is I am getting reults that are being affected by the datasets. When doing PCA for example, each dataset clusters alone, which is not right. I have 7 datasets overall and I dont want the source (the dataset) to affect the resuls.

Should I adjust the design in DESeq ? should I use RUVseq ? I'm a bit lost

DESeq2 r • 919 views
ADD COMMENT
0
Entering edit mode
2.2 years ago
ATpoint 82k

each dataset clusters alone

Yes, that is normal and expeced in RNA-seq which is strongly affected by the RNA extraction method, RNA integrity and kit used for library prep. You most likely cannot compare independent datasets, that's just how it is. It is somewhat wishful thinking that one can simply pull random datasets from GEO and then expect them to be comparable -- they're not.

ADD COMMENT
0
Entering edit mode

So this can't be done? Can't limma handle this kind of problem?

ADD REPLY
0
Entering edit mode

You have 7 datasets, and each is from a different study?

ADD REPLY
0
Entering edit mode

Yes.. can't I add the dataset number to the DESeq design or something ?

ADD REPLY
0
Entering edit mode

Yes, but that only works if batch is not confounded with condition.

ADD REPLY

Login before adding your answer.

Traffic: 1227 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6