Question: How to integrate multiple data sets from microarray platform prior meta-analysis?
gravatar for rice.researcher
17 months ago by
Korea, Republic Of
rice.researcher130 wrote:

Efficient method(s) for comparison of data sets that are generated from different microarray experiments (meta-analysis) have been asked previously. Based on the discussion, I found RankProd R package. However, after using the method, I wanted to compare the results with an another method since RankProd was developed years back. I wold like to know any alternative method available for integrating multiple data set(RNASeq not included).

Out of 10 data sets, 8 are from Affymetrix and 2 are Agilent. Expression profiles are based on different developmental stages of an organ in Arabidopsis.

ADD COMMENTlink modified 17 months ago by svlachavas560 • written 17 months ago by rice.researcher130
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe41k
The Ether
Kevin Blighe41k wrote:

Updated October 9, 2018


If the array versions are the same

If the Affymetrix arrays are all of the same version/type, then just process them all together at the same time. Same for the Agilent arrays. Afterwards, use one as the training dataset and the other as the validation dataset, i.e., obtain results from one, and then corroborate these [results] in the other.

You could attempt to perform statistical analyses on them combined, too (i.e. not as training and validation), but, in this case, include 'experiment' as a covariate in downstream analyses - there will likely be some batch effect.

If all data-set arrays are different

  • Process each independently
  • Summarise expression values across whole genes independently
  • Create a list of common features/genes across all arrays
  • Convert the data to Z-scores independently
  • Merge (bind) the data together
  • Build regression models predicting for your outcome and include 'experiment' as a covariate in the model

My advice is to avoid the use of programs that aim to directly correct for batch. It is preferable to include batch as a covarate in your statistical models.

ADD COMMENTlink modified 6 months ago • written 17 months ago by Kevin Blighe41k

Thanks for your suggestion!. My Affymetrix data sets are from different version and I tried the second choice. I already normalized all data sets separately and got the intensity values. Data sets were merged based on common probes. But I couldn't understand the last point as my statistical experience is limited. Is it possible to give more on it?

ADD REPLYlink written 17 months ago by rice.researcher130

Hi rice.researcher, on the last point, you could just build a multinomial logistic regression model and comparing each gene independently, whilst adjusting for 'batch' / 'array experiment':

In R Programming Language, it would be:

#Factorise your outcome variable and set 'StageI' as the reference level
DevelopmentalStage <- factor(DevelopmentalStage, levels=c("StageI", "StageII", ..., "StageX"))

#Build the model
model <- glm(DevelopmentalStage ~ Gene1 + ArrayExperiment, data=MyData, family=binomial(link="logit"))

#Get P value and estimates (beta coefficients)

MyData will be a data-frame that contains 1 column for DevelopmentalStage, another for ArrayExperiment (e.g. Affy1, Affy2, ..., Affy8, Agilent1, Agilent2), and then gene expression values. Samples are on rows, and obviously sample names for DevelopmentalStage and ArrayExperiment have to match those of your genes.

The model is essentially testing the hypothesis that the gene's expression is associated with the different developmental stages, but the stats will be adjusted based on ArrayExperiment, i.e., ArrayExperiment is a covariate for which we must adjust. They do similar things in large clinical trials, like adjusting for BMI, smoking status, race/ethnicity, household income, etc.

The only issue is that you'll have to run this over the entire dataset for each gene, and then output the stats values. I have posted some code here (scroll down a bit) on how to do this: R functions edited for parallel processing (parallelised for multi-core processing)

However, here is some simpler code (not parallelised): Assuming that your data is in a data-frame called modelling, with the first two columns being DevelopmentalStage and ArrayExperiment:

j <- 1
write.table(c("Gene\tBeta\tStandard.Error\tZ.Score\tOR\tp.value"), "Results.tsv", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE, append=FALSE)
for (i in 3:ncol(modelling))
    formula <- as.formula(paste("DevelopmentalStage ~ ArrayExperiment +  ", colnames(modelling)[i], sep=""))
    model <- glm(formula, modelling, family=binomial(link="logit"))

    Beta <- coef(summary(model))[,1][[8]]
    stderror <- coef(summary(model))[,2][[8]]
    Z <- coef(summary(model))[,3][[8]]
    OR <- exp(cbind(OR=coef(model)))[8,1]
    p <- coef(summary(model))[,4][[8]]

    wObject <- data.frame(colnames(modelling)[i], Beta, stderror, Z, OR, p)
    write.table(wObject, "Results.tsv", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE, append=TRUE)
    p[j] <- wObject[6]
    j <- j + 1

    if (j %% 500 == 0)
        print(paste(j, " transcripts processed", sep=""))
print(paste("Total models, counter 1: ", i-2, sep=""))
print(paste("Total models, counter 2: ", j-1, sep=""))

The only thing that you'll have to change to adapt this to your own code is the '[[8]]', which is essentially the row number for your gene in the results produced by summary()

Failing this, you can try to use limma and aim to include ArrayExperiment as a covariate that way.

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe41k

Very well explained !!. I am looking into it.

ADD REPLYlink written 17 months ago by rice.researcher130
gravatar for svlachavas
17 months ago by
svlachavas560 wrote:

Just to add some crusial comments to further extend the comprehensive answer of Kevin:

1) Firstly, regarding the two different platforms of microarrays used: are the same for each platform ? for instance in affymetrix you have mentioned from the link above that is affymetrix hgu133a-but there are also some other different "sub-platforms" ? are also the experimental conditions similar ? or you have evident variations in the experimental design concering the developmental stages in Arabidopsis, that could lead to a clear batch effect ? In other words, generally combining any of the datasets (even the similar affymetrix platform), you would have to construct a rather-complicated experiment to account for experiment/study-specific effects (as also some other potential problems with normalization, variance estimation etc.)

2) In my opinion, a first "basic and powerful" approach-if again you have similar experimental designs and biological questions-, would be to perform each DE analysis for each dataset separately. Then:

A) I would initially compare the DE probes-or more appropriately annotate to gene symbols-, to find any "common genuine DE genes" that are characterized constantly, between different datasets, or experiments. As also, possible differences.

2) In parallel, you could next perform a kind of "functional-enrichment meta-analysis"-again for each of your separate DE lists, conduct some "GO/KEGG" analysis, and inspect for common biological pathways or biological processes appeared in different datasets.

Finally, if you like to try the approach of merging, you could follow the instructions above, and perhaps perform probably a batch effect correction with ComBat with R package sva, using as a known covariate the different experimental study.

(*Regarding RankProd, it is another possibility, but again i would suggest the R package RankAggreg, which seems more appropriate regarding your approach: you would have to analyze each dataset separately, keep the topk ranked genes by a criterion, and then perform a similar analysis to keep the most "informative genes".)

Hope that helps,


ADD COMMENTlink modified 17 months ago • written 17 months ago by svlachavas560

Good answer Efstathios!

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Thanks for your views and suggestion. Affymetrix array are of different platforms with different experimental design. But I am not into DE and enrichment analysis other than data integration.Yes, those R packages are new to me and I would use it for comparison.

ADD REPLYlink written 17 months ago by rice.researcher130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1639 users visited in the last hour