Question

Microarray analysis in R (Limma): how to select genes exclusively expressed in one condition.

0

Entering edit mode

5.1 years ago

reubenmcgregor88 ▴ 50

Background: I am mining some data using the Limma package to look at differentially expressed genes.

I have imported the expression data such that values of 0 are NA.

y <- read_excel("exprs.xlsx", na="0")

My phenotypic data looks like this:

> pData
         tissue donor
    AV3     AV     3
    AV4     AV     4
    AV8     AV     8
    MV3     MV     3
    MV4     MV     4
    MV8     MV     8
    PV3     PV     3
    PV4     PV     4
    PV8     PV     8
    TV3     TV     3
    TV4     TV     4
    TV8     TV     8

So 3 donors (3,4,8), 4 tissue types (AV,MV,PV,TV).

group <- factor(pData$tissue) 

design <- model.matrix(~0+group)

contrast.matrix <- makeContrasts(
  mv_vs_tv = MV-TV,levels=design)

fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2, trend=T)

This will give me differential expressed genes between MV and TV.

I would now like to pull out the genes that are expressed exclusively in MV and not TV, AV or PV. By this I mean there is 0 or "NA" in all groups except MV, in any donor.

For example if donor 3 shows expression of "geneX" in MV but donor 4 and 8 do not, and all donors have NA expression in AV, PV or TV, I would like this to be reported. As well as if 2 or 3 donors show geneX expression in MV but not any other tissues.

Can anyone suggest some code that would do this, within or outside f Limma? I tried a few code chunks but am struggling to get it to work.

Many thanks.

microarray R limma • 2.0k views

ADD COMMENT • link updated 5.1 years ago by h.mon 35k • written 5.1 years ago by reubenmcgregor88 ▴ 50

score 3 · Answer 1 · 2019-03-13

Due to the way that microarray profiling works, which is based on fluorescent signal intensities, it is very difficult to say that something is not expressed at all. Probes may fall into the background noise but virtually all will still return some signal, whether it be related to the target mRNA or not. This is different from RNA-seq where a 0 actually makes intuitive sense.

You may consider a better approach by converting the log2-normalised expression values to the Z scale and then taking, as an example, >4 as expressed and <-4 as not expressed. This simple procedure is employed frequently. For one, cBioPortal use Z-scores for identifying genes expressed in tumours over normals, as an example.

score 3 · Answer 2 · 2019-03-13

3

Entering edit mode

5.1 years ago

Friederike 8.9k

I absolutely second Kevin's advice, but if you need to figure out how to subset a data.frame based on NA values, these examples may give you some leads:

> tt <- data.frame(a = c(NA, 1,2,3), b = c(NA, NA, 1,2), c = c(NA, NA, NA, 1))
> tt
   a  b  c
1 NA NA NA
2  1 NA NA
3  2  1 NA
4  3  2  1
> subset(tt, is.na(b))
   a  b  c
1 NA NA NA
2  1 NA NA
> subset(tt, is.na(b) & is.na(a))
   a  b  c
1 NA NA NA

ADD COMMENT • link 5.1 years ago by Friederike 8.9k

0

Entering edit mode

Thank you Friederike, could you please expand on this.

Using your example I would like to select (for instance), row 2 based on a)it has NA in columns b and c, b) it has a non-NA in column a.

Using the "subset" above I could specific which rows to pull out based on NA but not which rows to pull out based on the fact stye have non-NA.

I have found a rather convoluted way of getting it but ay advice on how to simplify much appreciated:

exprs_MV_NA_2 <- 
  exprs%>%
  filteris.na(c(AV3))) %>%
  filteris.na(c(AV4))) %>%
  filteris.na(c(AV8))) %>%
  filteris.na(c(PV3))) %>%
  filteris.na(c(PV4))) %>%
  filteris.na(c(PV8))) %>%
  filteris.na(c(TV3))) %>%
  filteris.na(c(TV4))) %>%
  filteris.na(c(TV8)))

The above step gives me all rows where there is "NA" in all but the AV group

exprs_MV_NA_2 <- exprs_MV_NA_2[rowSumsis.na(exprs_MV_NA_2)) != ncol(exprs_MV_NA_2), ]

Then this step removes rows with all NA and seems to leave behind what I wanted...

ADD REPLY • link 5.1 years ago by reubenmcgregor88 ▴ 50

1

Entering edit mode

the expression for "not NA" is this: !is.na(x), i.e.: subset(tt, !is.na(a) & is.na(b) & is.na(c) )

ADD REPLY • link 5.1 years ago by Friederike 8.9k

score 2 · Answer 3 · 2019-03-13

Why are you considering zero as NAs in your intensity values? NAs have some special properties in R and most analyses drop them before performing calculations. See Gordon Smyth answer at the old Bioconductor thread Question: limma with missing values for a comment about how limma deals with NAs, and a recommendation of not introducing them.

I third the recommendation that it makes no sense to think about "not expressed" genes for microarrays. But I disagree with Kevin Blighe about RNAseq because:

1) even RNAseq has background noise, as most of the genome is transcribed at low levels (see the ENCODE Project paper An integrated encyclopedia of DNA elements in the human genome, also read Exactly 8.2% or 80% of the Human Genome is Functional to get acquainted with the controversy that arose from ENCODE).

2) thus, sequencing depth will play a large role in what you consider "non-expressed" in RNAseq. With shallow sequenced samples you will get more of "non-expressed" genes than with deeper sequenced samples.