Combining different conditions of study from different experiments (same platform) for microarray meta-analysis
1
1
Entering edit mode
3.3 years ago

I would like to conduct a microarray meta-analysis of differential expression study of Condition1 vs Condition2 where I could not find many such specific studies with both Condition1 and Condition2 samples. However, there are various experiments that include either condition1 alone or condition2 alone from the same platform.

Could I just download the CEL files for Condition1 and Condition2 from two/three different experiments and process them as usual as a single experiment and identify the DEGs?

(Such an approach was used in https://academic.oup.com/nar/article/45/17/9860/4084660 - see Table 1/SCNvsWB)

However I would like to further make sure that the approach is accurate or Do I need to conduct any batch effect removal algorithms like ComBat before merging the two conditions for a single study?

Meta-analysis Microarray Control Test • 1.2k views
1
Entering edit mode
3.2 years ago

Difficult to give a conclusive answer. I would proceed to obtain all files and then process them assuming no batch effect. A PCA bi-plot will then quickly reveal any batch effect. May also see it on the box and whiskers plot. If there is a batch effect, you have 2 options:

• directly modify your data to adjust for batch, e.g., via ComBat, SVA, removeBatchEffects (limma), etc
• include batch as a covariate in your design formula

Kevin

0
Entering edit mode

Hello Kevin,

Can you please elaborate on adding how to bring in batch as a covariate into the model matrix design?

My work plan involves using around 10 different microarray datasets (same platform) to identify the DEGs between case and control. Each of these datasets has its own case and control subgroups. I was suggested to RMA normalize and fit a linear model to them separately followed a calculating a meta p-value or meta LogFC. However, I am not sure how this process works and what R package to use for it?

Considering the approach you suggested here, we should be knowing what the Batch covariate is (please correct me if I am wrong). In my work, how should I account for the batch for all the experiments? Also, these datasets measure the gene expression in different tissues for a same disease condition Vs the controls...

0
Entering edit mode

Have you looked at the RankProd package? - http://bioconductor.org/packages/release/bioc/html/RankProd.html

If you are going to process each dataset separately, and also analyse the results separately, then you do not need to worry too much about batch. However, if you are going to combine / merge the datasets, then you need to add batch as a term in your design formula.

0
Entering edit mode

So my design matrix formula (for limma) needs to be something like

mod<-model.matrix(~as.factor(disease_status)+batch,data)
#is adding the term "batch" within the matrix design enough to account for the confounders???


provided I merged all the columns (cases and controls) into a single Expression matrix and designed my disease_status column in the phenoData accordingly. Am I on the right path?

I am still confused about merging these datasets together though, as each of these datasets studies different types of tissues when they are merged together I feel that there could clash due to the variability among the samples. Can you please enlighten me on this concept of merging such different datasets together for a DEG analysis.

Regarding the RankProd, I am still wrapping my head around on what functions to use from this package to generate the meta-statistic values. Since this is my first ever microarray meta-analysis, it's really difficult to understand and apply several of these approaches.

Thank you so much for your tremendous support!

1
Entering edit mode

Yes, that model formula looks okay. However, now that you mention how each dataset involves a different tissue, I would recommend not to try to merge them. I thought that they were each, e.g., cardiomyocytes / heart tissue. In this case, too, perhaps even RankProd makes little sense and, instead, the results of each analysis should simply be interpreted separately.

0
Entering edit mode

As in I need to do the DEG analysis separately for those 10 different datasets and just find the union of the DEGs across them?

I came across a paper where they did a similar study using datasets studying different tissues, but there was no clear explanation of how they went forward and interpreted the DEGs from all the different datasets.

This is the paper

0
Entering edit mode

This is out of this topics scope, but some of the datasets I have can be subdivided into groups based on their phenotype (each group include case and controls). I am using Bioconductor packages for this analysis. I understand that during the DEG estimation (using limma) I need to have these groups as separate expression matrices. However, I am not sure if I can normalize the whole dataset (using RMA) and then subdivide the samples into groups or Do I need to separate out the groups from the parent dataset before I do the normalization process? Do the final results vary between these approaches?

1
Entering edit mode

There is no fixed rule, but most would tell you that it is better to normalise all samples combined and to then 'stratify' / focus your analysis on certain groups. Obviously, remove any outliers that you find, too.