how to process the raw data of illumina Arrays( Illumina HumanHT-12 V4.0 expression beadchip)
1
0
Entering edit mode
2.6 years ago
842978151 • 0

For getting reliable expression matrices, I wanna know how to process the two non-normalized data from GEO database(GSE98278 & GSE47472). Here are the colnames of the two non-normalized data.Could someone share the codes?This will be a great help for me.

illumina • 1.2k views
ADD COMMENT
0
Entering edit mode

enter image description hereenter image description here

ADD REPLY
2
Entering edit mode
2.6 years ago
Gordon Smyth ★ 7.0k
library(limma)
x1 <- read.ilmn("GSE98278_Non-normalized_data.txt.gz", expr="signal")
y1 <- neqc(x1)
x2 <- read.ilmn("GSE47472_Rawdata_GEO_AAA_Neck.txt.gz", expr="Sample")
y2 <- neqc(x2)
ADD COMMENT
0
Entering edit mode

Based on the code you gave me,I got the following results.But it doesn't seem like the expression matrix I've seen before.Is that right? And the first pitcture shows that we have lost the gene ID,what‘s the problem? The third picture is an expression matrix which has been normalized,but the data doesn't look as even as the first two photos.why?

enter image description here enter image description here enter image description here

ADD REPLY
0
Entering edit mode

OK, I agree that the values look wrong and I have now had a closer look at the contents of the so-called raw data files. The problem is that the data files uploaded by the authors of these GEO submissions are non-standard. Rather than uploading raw intensities exactly as exported by the Illumina software, the authors have uploaded processed files with non-standard column names and non-standard entries.

To read the gene id for the first picture you could specify probeid="ID_REF" in the read.ilmn function call.

However there is a greater problem. The GGSE98278 file appears to contain normalized log2 expression values instead of raw intensities. Hence you can simply extract the signal columns and assume they are normalized log2 values. Or better you could downoad the idat files instead and read them using the limma read.idat function. Then you could be sure that everything is correct.

For the GSE47472 dataset, the authors appear to converted the intensities values for each sample to percentages relative to the 75th percentile for that sample, a practice that makes it impossible to recover the log2 intensity scale. Very frustrating. You could try running neqc with a smaller offset value for that dataset, say offset=1, and see if it improves things.

ADD REPLY

Login before adding your answer.

Traffic: 2628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6