Question

Is it correct to merge two raw CEL datasets and then perform quality control, background correction, and normalization on it?

0

Entering edit mode

2.9 years ago

Sib ▴ 60

Hello, biostars.

I have raw CEL files of two microarray datasets that I read them in R by the ReadAffy function and finally, I want to remove the batch effect between them. I want to know if is it correct to merge these two datasets at first and then perform quality control , background correction, and normalization and then perform batch effect removal? Or I should perform quality control , background correction, and normalization separately for each dataset at first then merge these datasets and remove the batch effect?

R microarray affymetrix batch-effect quality-control • 816 views

ADD COMMENT • link updated 8 days ago by Ram 43k • written 2.9 years ago by Sib ▴ 60

score 0 · Answer 1 · 2021-06-16

0

Entering edit mode

2.9 years ago

Ahill ★ 1.9k

You should read all CEL files from both datasets, to yield a single merged starting dataset. And then perform QC, normalization, and batch effect removal on the combined dataset. The raw CEL intensities are not context dependent - they'll be the same if you read the CEL files as two batches or all together as a single group. After you have the CEL intensities, then the subsequent steps should be carried out on the entire dataset combined, including checking for batch effects and correcting for them if needed.

ADD COMMENT • link 2.9 years ago by Ahill ★ 1.9k

0

Entering edit mode

Thanks for your answer. So what is the reason for that? For example, in background correction, the background intensities for all probs are expected to be similar among probs of one dataset but it is different from the background intensity of the other dataset. If we merge these two datasets first and then perform background correction the amount of background that decreases from the prob intensities do not make sense. The same goes for normalization. Also about quality control by considering the methods that are used in QC packages like affyPLM (It takes the difference of log expressions on the chip to its log expression on the reference chip which is constructed as the median expression value over all chips that means if we merge two datasets that have a batch effect the reference chip may not be correctly constructed) and simpleaffy (which assumes that the trimmed mean intensity for each array should be constant and it can not be right when we merge two different datasets because of batch effect. In each dataset the chips have constant intensity but can not have the same intensity as the other dataset because of the batch effect) I think we should not merge the two datasets before quality control. What are you think about these reasons? Do they make sense?

ADD REPLY • link 2.9 years ago by Sib ▴ 60