Question

how to process the raw data of illumina Arrays( Illumina HumanHT-12 V4.0 expression beadchip)

0

Entering edit mode

2.6 years ago

842978151 • 0

For getting reliable expression matrices, I wanna know how to process the two non-normalized data from GEO database(GSE98278 & GSE47472). Here are the colnames of the two non-normalized data.Could someone share the codes?This will be a great help for me.

illumina • 1.2k views

ADD COMMENT • link updated 2.5 years ago by Gordon Smyth ★ 7.0k • written 2.6 years ago by 842978151 • 0

0

Entering edit mode

enter image description here

ADD REPLY • link 2.6 years ago by 842978151 • 0

score 2 · Answer 1 · 2021-09-28

2

Entering edit mode

2.6 years ago

Gordon Smyth ★ 7.0k

library(limma)
x1 <- read.ilmn("GSE98278_Non-normalized_data.txt.gz", expr="signal")
y1 <- neqc(x1)
x2 <- read.ilmn("GSE47472_Rawdata_GEO_AAA_Neck.txt.gz", expr="Sample")
y2 <- neqc(x2)

ADD COMMENT • link 2.6 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Based on the code you gave me，I got the following results.But it doesn't seem like the expression matrix I've seen before.Is that right? And the first pitcture shows that we have lost the gene ID,what‘s the problem？ The third picture is an expression matrix which has been normalized，but the data doesn't look as even as the first two photos.why?

enter image description here

ADD REPLY • link 2.5 years ago by 842978151 • 0

0

Entering edit mode

OK, I agree that the values look wrong and I have now had a closer look at the contents of the so-called raw data files. The problem is that the data files uploaded by the authors of these GEO submissions are non-standard. Rather than uploading raw intensities exactly as exported by the Illumina software, the authors have uploaded processed files with non-standard column names and non-standard entries.

To read the gene id for the first picture you could specify probeid="ID_REF" in the read.ilmn function call.

However there is a greater problem. The GGSE98278 file appears to contain normalized log2 expression values instead of raw intensities. Hence you can simply extract the signal columns and assume they are normalized log2 values. Or better you could downoad the idat files instead and read them using the limma read.idat function. Then you could be sure that everything is correct.

For the GSE47472 dataset, the authors appear to converted the intensities values for each sample to percentages relative to the 75th percentile for that sample, a practice that makes it impossible to recover the log2 intensity scale. Very frustrating. You could try running neqc with a smaller offset value for that dataset, say offset=1, and see if it improves things.

ADD REPLY • link 2.5 years ago by Gordon Smyth ★ 7.0k