Question

Negative gene expression values in "Illumina HumanHT-12 WG-DASL V4.0 R2 expression beadchip"

0

Entering edit mode

3.9 years ago

nazaninhoseinkhan ▴ 520

Dear all,

I am trying to analyze an "Illumina HumanHT-12 WG-DASL V4.0 R2 expression" beadchip.

However, when I open the non normalized data series, several negative values are present.

Most of them are very large(for example, -20)

Please guide me how should I deal with these negative values.

Is it correct to change all negative values to 0?

I am looking forward to your comments

Nazanin

Negative expression values • 2.5k views

ADD COMMENT • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Hey Nazanin, could you share the study ID? I would not necessarily convert these to 0.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, Id:GSE93825

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

1

Entering edit mode

enter image description here

ADD REPLY • link 3.9 years ago by ATpoint 81k

score 0 · Answer 1 · 2020-05-16

0

Entering edit mode

3.9 years ago

Kevin Blighe 87k

The negative values are due to the background correction step, but there should not be negative values in the raw data, as far as I understand. The actual raw data appears to be contained in the GSE93825_RAW.tar file in this GEO record.

To be honest, I would just take the data via GEO2R and use that:

library(Biobase)
library(GEOquery)
gset <- getGEO("GSE93825", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL18281", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

Kevin

ADD COMMENT • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I need normalized expression values of this data set. Since several other data sets are present in my study, I need to remove batch effects from all using SVA package.

I wanted to use limma package to analyze raw data, however bgx files was not available. This bgx file is required by read.idat() function.

So I decided to use non-normalized file instead, but I faced with the problem of negative values.

I found some solutions with this problem, for example finding the min value among these negative values and add this value to all expression values, though I am not sure if this solution is reliable

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

I see. You can obtain the BGX file from here. I would favour this approach, following the guidance in the limma manual.

I found some solutions with this problem, for example finding the min value among these negative values and add this value to all expression values, though I am not sure if this solution is reliable

This solution is not entirely invalid. Doing this just shifts the distribution:

x <- log2(matrix(rexp(20000, rate=.1), ncol=200))
hist(x, breaks = 50, col = 'skyblue', xlim = c(-20, 20))
hist((x + abs(min(x))), breaks = 50, col = 'pink', add = TRUE)

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

So doing this will shift the distribution from normal (bell shape) and then t-test can not be applied?

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

No, it does not modify the distribution. It just moves it along the x-axis. So, it is still, in my example above, a normal distribution.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I would start with the IDAT files, or the data from GEO2R. It is not clear what is present in the 'non_normalsied' file (?)

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I normalized the data using limma and neqc() function in R 3.5.0 . Though the majority of values are 8.9. It is strange!

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

You excluded the detection p-value columns?

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

No Kevin, I don't think so. This is the punch of codes I used:

idatfiles <- dir(pattern="idat")

bgxfile <- dir(pattern="bgx")

x <- read.idat(idatfiles, bgxfile)

x$other$Detection <- detectionPValues(x)

propexpr(data)

y <- neqc(data)

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

I see, you will want to remove those values, and also the standard deviation values; otherwise, the program will assume that they are [very low] expression levels and this could produce the original result that you observed.

You seem to have two objects here, too: x and data? If you check the colnames() of your data, it should reveal whether or not the detection p-values and standard deviations are still present.

You could try to follow what I am doing in this thread: A: illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data

The Illumina BeadArray data that is available in the public domain is quite annoying.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey Kevin, I had replaced "data" with x through the code, because I did not understand what is the "data":

idatfiles <- dir(pattern="idat")

bgxfile <- dir(pattern="bgx")

x <- read.idat(idatfiles, bgxfile)

x$other$Detection <- detectionPValues(x)

propexpr(x)

y <- neqc(x)

and I checked the results of "neqc" normalization, there is no columns for p-values or SD

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

You could check that your columns are all numeric. They may be factors / categorical? What is the output of

class(x)
str(x)

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey Kevin,

This is the results of class(x) and str(x):

> class(x)
[1] "EListRaw"
attr(,"package")
[1] "limma"
> str(x)
Formal class 'EListRaw' [package "limma"] with 1 slot
  ..@ .Data:List of 5
  .. ..$ : chr "illumina"
  .. ..$ :'data.frame': 40 obs. of  1 variable:
  .. .. ..$ IDATfile: chr [1:40] "GSM2463039_9020374101_A_Grn.idat" "GSM2463040_9020374101_B_Grn.idat" "GSM2463041_9020374101_D_Grn.idat" "GSM2463042_9020374101_E_Grn.idat" ...
  .. ..$ :'data.frame': 48210 obs. of  4 variables:
  .. .. ..$ Probe_Id        : chr [1:48210] "ILMN_3166687" "ILMN_3165565" "ILMN_3164808" "ILMN_3165363" ...
  .. .. ..$ Array_Address_Id: int [1:48210] 5270161 4230037 60372 5260356 6060692 6370471 1710435 1400612 5130189 70278 ...
  .. .. ..$ Status          : chr [1:48210] "regular" "regular" "regular" "regular" ...
  .. .. ..$ Symbol          : chr [1:48210] "ERCC-00162" "ERCC-00071" "ERCC-00009" "ERCC-00053" ...
  .. ..$ : num [1:48210, 1:40] 132 122 220 112 146 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:48210] "5270161" "4230037" "60372" "5260356" ...
  .. .. .. ..$ : chr [1:40] "GSM2463039_9020374101_A_Grn" "GSM2463040_9020374101_B_Grn" "GSM2463041_9020374101_D_Grn" "GSM2463042_9020374101_E_Grn" ...
  .. ..$ :List of 2
  .. .. ..$ NumBeads: num [1:48210, 1:40] 24 22 27 25 26 27 28 13 20 27 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:48210] "5270161" "4230037" "60372" "5260356" ...
  .. .. .. .. ..$ : chr [1:40] "GSM2463039_9020374101_A_Grn" "GSM2463040_9020374101_B_Grn" "GSM2463041_9020374101_D_Grn" "GSM2463042_9020374101_E_Grn" ...
  .. .. ..$ STDEV   : num [1:48210, 1:40] 47.7 40.8 87.6 26.2 47.6 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:48210] "5270161" "4230037" "60372" "5260356" ...
  .. .. .. .. ..$ : chr [1:40] "GSM2463039_9020374101_A_Grn" "GSM2463040_9020374101_B_Grn" "GSM2463041_9020374101_D_Grn" "GSM2463042_9020374101_E_Grn" ...
>

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Thanks - that does not seem too strange.

You mentioned this:

Though the majority of values are 8.9

How does the data appear on a histogram after normalsiation? Note that you should remove ERCC controls after normalisation (I can see them in your pasted output, above).

The best for this study may be to obtain the data already normalised via GEO2R.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey Kevin,

Trying to get histogram from data after normalization I got this error:

Error in hist.default(y, breaks = 50, col = "skyblue", xlim = c(-20, 20)) : 'x' must be numeric

However, I integrated the normalized data with other data and introduced them in to SVA to remove batch effects.

Every thing sound OK and I could go to the end of differential expression analysis process

ADD REPLY • link 3.9 years ago by nazaninhoseinkhan ▴ 520