Batch effect correction of 2 studies with different covariates
1
2
Entering edit mode
24 days ago
Geo ▴ 20

Hi all,

I want to integrate RNA-seq data from 2 studies. My plan is to merge the count matrices and then perform batch effect correction using Combat-seq. However, while study A has samples at different ages and sex, the age and sex of the samples in study B is unknown. Also, study A has samples from 2 tissues but study B only from 1. Do you have any advice on how should I proceed? I have two options in mind.

1. Consider each sub-dataset (samples grouped by batch, sex, age and tissue) as a different batch. Study B is the same batch.
2. Forget about age and sex and only group by batch and tissue, so there are 4 batches for study A and 1 for study B

Toy example (I have more samples from each case):

study batch group   age sex  tissue
A     1   control 10   M   cortex
A     2   control 10   F   cortex
A     1   control 20   M   cortex
A     2   control 20   F   cortex
A     1   cases   10   M   cortex
A     2   cases   10   F   cortex
A     1   cases   20   M   cortex
A     2   cases   20   F   cortex
A     1   control 10   M   cerebellum
A     2   control 10   F   cerebellum
A     1   control 20   M   cerebellum
A     2   control 20   F   cerebellum
A     1   cases   10   M   cerebellum
A     2   cases   10   F   cerebellum
A     1   cases   20   M   cerebellum
A     2   cases   20   F   cerebellum
B     3   control -    -   cortex
B     3   control -    -   cortex
B     3   control -    -   cortex
B     3   control -    -   cortex
B     3   cases   -    -   cortex
B     3   cases   -    -   cortex
B     3   cases   -    -   cortex
B     3   cases   -    -   cortex


Thanks a lot

batch-effect R RNA-seq • 412 views
1
Entering edit mode
23 days ago
ponganta ▴ 220

Try both, check sample variability, dispersion and P-value distributions. By "Combat-seq" do you mean the package ComBat? I'm not sure this is sufficient.

1
Entering edit mode

RUVseq might be the way to go here, to be honest.

0
Entering edit mode

Thanks for the link, didn't know that one!

0
Entering edit mode

The question remains, what should I do with the covariates?

0
Entering edit mode

If you run RUVseq on the data from study B only, you'd get predicted values for age and sex but these will be numerical. I suppose you could re-encode these into categorical variables somehow. E.g., by rounding off the age values for instance, and deducing a relationship between the rounded values and the actual age classes (perhaps both sets of values are positively correlated, for instance); and for the sex it's just binary, so it should be somewhat straightforward but there might be no way to disambiguate male from female (since there's no intrinsic order in this case). I suppose you could actually use study A's data to discern these relationships somehow, so that you can map them accurately in B's case.

Or you could try and predict surrogate variables standing in for age and sex for the entire data set (i.e., both A and B) and use those instead of the age and sex values you already have here. I think this might be the more straightforward option.

The only other alternative would be to drop those variables entirely.

0
Entering edit mode

Combat-seq is a batch effect adjustment tool for bulk RNA-seq count data based on Combat. What should I look for in these variables? (sample variability, dispersion and P-value distributions)