Hi all,
I want to integrate RNA-seq data from 2 studies. My plan is to merge the count matrices and then perform batch effect correction using Combat-seq. However, while study A has samples at different ages and sex, the age and sex of the samples in study B is unknown. Also, study A has samples from 2 tissues but study B only from 1. Do you have any advice on how should I proceed? I have two options in mind.
- Consider each sub-dataset (samples grouped by batch, sex, age and tissue) as a different batch. Study B is the same batch.
- Forget about age and sex and only group by batch and tissue, so there are 4 batches for study A and 1 for study B
Toy example (I have more samples from each case):
study batch group age sex tissue
A 1 control 10 M cortex
A 2 control 10 F cortex
A 1 control 20 M cortex
A 2 control 20 F cortex
A 1 cases 10 M cortex
A 2 cases 10 F cortex
A 1 cases 20 M cortex
A 2 cases 20 F cortex
A 1 control 10 M cerebellum
A 2 control 10 F cerebellum
A 1 control 20 M cerebellum
A 2 control 20 F cerebellum
A 1 cases 10 M cerebellum
A 2 cases 10 F cerebellum
A 1 cases 20 M cerebellum
A 2 cases 20 F cerebellum
B 3 control - - cortex
B 3 control - - cortex
B 3 control - - cortex
B 3 control - - cortex
B 3 cases - - cortex
B 3 cases - - cortex
B 3 cases - - cortex
B 3 cases - - cortex
Thanks a lot
RUVseq
might be the way to go here, to be honest.Thanks for the link, didn't know that one!
The question remains, what should I do with the covariates?
If you run
RUVseq
on the data from study B only, you'd get predicted values for age and sex but these will be numerical. I suppose you could re-encode these into categorical variables somehow. E.g., by rounding off the age values for instance, and deducing a relationship between the rounded values and the actual age classes (perhaps both sets of values are positively correlated, for instance); and for the sex it's just binary, so it should be somewhat straightforward but there might be no way to disambiguate male from female (since there's no intrinsic order in this case). I suppose you could actually use study A's data to discern these relationships somehow, so that you can map them accurately in B's case.Or you could try and predict surrogate variables standing in for age and sex for the entire data set (i.e., both A and B) and use those instead of the age and sex values you already have here. I think this might be the more straightforward option.
The only other alternative would be to drop those variables entirely.
Combat-seq is a batch effect adjustment tool for bulk RNA-seq count data based on Combat. What should I look for in these variables? (sample variability, dispersion and P-value distributions)