Question

normalization after merging datasets

0

Entering edit mode

4.1 years ago

parinv ▴ 80

I merged three datasets, two from same platform and one from a different platform, after merging I performed normalization and try to visualize using boxplot. But I am not getting a proper boxplot. I used the following codes:

# normalization of merged file
#change to summarize experiment file
mergenorm<- normalize(sum, norm.method = "quantile", data.type = "ma")
#converted to matrix file for boxplot
JM3<- assay(mergenorm)
#boxplot for normalized data
boxplot(exprs(JM3))

boxplot: ![got a boxplot in this image][1]

can anyone suggest what went wrong? or I can use any other plot?

R • 1.7k views

ADD COMMENT • link updated 4.1 years ago by svlachavas ▴ 790 • written 4.1 years ago by parinv ▴ 80

2

Entering edit mode

First off, normalizing between different microarray platforms is generally futile - the discrepancies between them are just too vast to compare between platforms.

Second, there is not enough information here for us to help you. I'm assuming this is RNA microarray data, but you should explicitly state that. What are you using to process these? We need a minimal, complete example of how you're dealing with this. What package is the boxplot function from? What is your end goal?

ADD REPLY • link 4.1 years ago by jared.andrews07 ★ 16k

1

Entering edit mode

4.1 years ago

svlachavas ▴ 790

Initially, as Jared mentioned, you should provide detailed information about your experimental design and biological question, without just explicitly posting some code chunks, as others will be more willing and able to help you.

In conjunction with the above answers, you might also want to check this:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2641-8

Additionally, if both datasets share the same phenotype and have similar experimental condition, you can perform more "elegant" DE tests such as roast and mroast, testing if the DEG list from your one experiment has the "same behaviour" in your other dataset, minimizing the necessity of merging expression data.

Finally a semantic or functional analysis separately, might reveal common perturbed biological mechanisms.

Efstathios

ADD COMMENT • link 4.1 years ago by svlachavas ▴ 790

0

Entering edit mode

Thank you so much for sharing this.

ADD REPLY • link 4.1 years ago by parinv ▴ 80

score 4 · Accepted Answer · 2020-02-29

4

Entering edit mode

4.1 years ago

Kevin Blighe 87k

Agreed to the above by Jared. I think that boxplot() may just be the standard function that comes with base R, though.

pv, although the issue that you want to address is related to the boxplot, it is important to understand your general methodology here. One cannot just take 3 datasets from GEO and then 'hack' them together without justification. Even if the datasets are related to the same microarray type and version, batch effects will still exist.

For what it is worth,the boxplot is simply too crowded, and it looks like there is an extreme number of outliers, which is what one would expect when normalisaing disparate datasets together.

For what it is worth, I have given answers in this area previously:

Kevin

ADD COMMENT • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank You, here I merged 3 microarray datasets, two are from Affymetrix HG-U133_Plus_2 and one from Affymetrix HG-U133A. I normalized the data separately using the affy package and remove batch effects using limma package. Then created a gene list and merged all three datasets. After merging I again performed normalization and remove batch effect function, to visualize the normalized data I used standard boxplot() function.

I have a list of questions if you can answer them:

Is it important to perform normalization after merging data or can I skip that step and only remove batch effects?
Should I convert data to the Z- score?
Can you please elaborate on the Z- score from your previous answer? Why is that important? What difference does it make?

Parinv.

ADD REPLY • link 4.1 years ago by parinv ▴ 80

2

Entering edit mode

I am not sure that it's a good idea to apply a batch correction twice... Have you tracked the values of some of the probes to see how they have changed after the 2 batch corrections...? Technically, one should not even have to directly modify the data for batch.

Z-scores are intuitive to apply to data that is already normalised. The Z-transformation converts values to 'standard deviations from the mean'. These are sometimes called 'standard scores' because they are standardised across data-types.

With your data, I would process/normalised each separately, filter them for common probes (across the 3), and then merge them together, using batch as a covariate for limma

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k