Question

How to know expression matrix on GEO is normalized or not?

1

Entering edit mode

4.7 years ago

MatthewP ★ 1.4k

Hello, I always believed expression matrix on GEO is normalized. However, I get huge big log2FC from GSE85957 today.

> head(expr_3)
# A tibble: 6 x 8
  SYMBOL  logFC AveExpr     t    P.Value adj.P.Val      B ENTREZID
  <chr>   <dbl>   <dbl> <dbl>      <dbl>     <dbl>  <dbl> <chr>   
1 Spp1    3419.   2421. 10.2  0.00000368   0.00128  0.221 25353   
2 Gstp1   2125.   2338. 10.3  0.00000328   0.00128  0.243 24426   
3 Cyp2e1  2047.   2833.  4.32 0.00204      0.0235  -1.89  25086

Here is how my expression data extracted

 gse_path <- "/datapool/pengguoyu/Microarray/20190711_geo/rawdata/GSE85957_series_matrix.txt.gz"
 gse <- getGEO(filename=gse_path, AnnotGPL=TRUE)
 expr <- exprs(gse)

So I go back to check expression matrix

PROBEID GSM2288460      GSM2288461      GSM2288462      GSM2288463      GSM2288464 
1367452_at      1165.0328       1011.4838       1193.8429       1143.6874       1162.2721
1367453_at      512.07166       519.57355       502.8087        433.26254       480.2318    
1367454_at      647.18243       619.50635       673.89526       644.89575       685.5907        
1367455_at      1226.1555       1299.9249       1318.0239       1363.5055       1308.6063          
1367456_at      1530.6841       1611.0748       1768.4469       1761.0474       1751.5911           
1367457_at      426.08826       282.9359        433.74475       421.27148       445.81595

This seems to be data without normalized. How can I know any one expression matrix from GEO is normalized or not, wether I can apply lmFit function from limma directly? Thanks.

geo limma • 4.8k views

ADD COMMENT • link updated 4.7 years ago by jared.andrews07 ★ 16k • written 4.7 years ago by MatthewP ★ 1.4k

score 5 · Accepted Answer · 2019-08-22

5

Entering edit mode

4.7 years ago

jared.andrews07 ★ 16k

Well, the easiest way is to read the data processing sections for each sample.

The data were analyzed with Microarray Suite version 5.0 (MAS 5.0) using GeneData Expressionist® Pro Refiner. The trimmed mean target intensity of each array was arbitrarily set to 100.

The next easiest is to use GEO's built-in analysis tools (really just R scripts, but whatever) to view the value distributions. RMA normalized microarrays are typically very obvious, as their distributions are all the same.

That isn't the case here, though the distributions aren't completely nuts. These have clearly been normalized in some capacity, though you'd typically hope for bit more detail in how it was done. I don't know exactly what limma expects (does it expect log2 values?), so you may do a little more reading on that. You could also normalize through limma and just see if the results make more sense.

ADD COMMENT • link 4.7 years ago by jared.andrews07 ★ 16k

5

Entering edit mode

You posted while I was writing my own answer, but, yes, the samples are normalised.

Here was my answer:

---------------------------------

The answer is that you can never be sure. The GEO even states this on their web-site (somewhere) that they cannot guarantee that each dataset will be normalised. This is partly why data curation can be so problematic and time consuming.

I have looked at your dataset, though, and the data is normalised; however, the normalisation method that was used was MAS 5.0, which is not as common as RMA normalisation. If you look at an individual sample record, you will see this:

Data processing The data were analyzed with Microarray Suite version 5.0 (MAS 5.0) using GeneData Expressionist® Pro Refiner. The trimmed mean target intensity of each array was arbitrarily set to 100.

[source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2288450]

So, when you download the data, I think that you should log2 transform it. MAS 5.0 normalisation does not involve any log2 transformation (unlike RMA).

If you plot a histogram of your pre- and post-transformed data, you will instantly see the effect of log2 transformation:

library(Biobase)
library(GEOquery)
gset <- getGEO("GSE85957", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL1355", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

par(mfrow=c(1,2))
hist(exprs(gset))
hist(log2(exprs(gset)))

So, in summary:

your data is normalised by MAS 5.0
for downstream applications, you should log2 transform it
if you prefer RMA normalisation, re-process te CEL files

Kevin

ADD REPLY • link 4.7 years ago by Kevin Blighe 87k