Question

Bioconductor, how to select a subset of samples in an ExpressionSet?

0

Entering edit mode

6.5 years ago

Davide Chicco ▴ 120

I'm working on an R script that downloads gene expression data from GEO, through Bioconductor and the getGEO() function.

These commands download all the 436 samples of the repository, but I'm only interested in 157 of them. Precisely, I'm interested in handling only the "samples collection:ch1" column with values ""on the 1st day of MI (admission)" or "N/A". How can I select this subset?

I tried to download the complete dataset into the gset variable, and then to apply the following command:

gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]

but this way only the phenoData feature of the gset_reducted variable correctly contained the 157 samples. The other features of the gset_reducted variable, instead, still contain 436 samples.

How can I select my subset of samples right from the beginning?

Here's my working R code:

setwd(".")
options(stringsAsFactors = FALSE)

source("https://bioconductor.org/biocLite.R")

listOfBiocPackages <- c("oligo", "GEOquery", "affyio", "biomaRt", "sva", "pamr", "limma", "BiocParallel", "genefilter", "GO.db")

bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")

# check there's still something left to install
if( length(bioCpackagesNotInstalled) ) {
    biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}


library("oligo")
library("GEOquery")
library("affyio")
library("biomaRt")
library("sva")
library("pamr")
library("limma")
library("BiocParallel")

GSE_code <- "GSE59867"
getGEOSuppFiles(GSE_code) 
gset <- getGEO(GSE_code, GSEMatrix =TRUE, getGPL=FALSE)

if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

gset_reduced <- gset

gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]

Can anyone help me?

Thanks!

bioconductor expressionset subset • 9.3k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 152k • written 6.5 years ago by Davide Chicco ▴ 120

0

Entering edit mode

2.4 years ago

bennikkel • 0

Hi Davide,

I adapted Kevin's method. I think it will help. Notice that I'm using the tidyverse package.

filter_terms <- gset@phenoData@data %>%
  filter(str_detect(samples collection:ch1, 'on the 1st day of MI (admission)|N/A')) %>% 
  dplyr::select(samples collection:ch1) %>%
  as_vector()

filter <- gset@phenoData@data[gset@phenoData@data$samples collection:ch1 %in% filter_terms, ] %>% 
  rownames()

gset_filter <- gset[,filter]

For the dataset I used, I needed the row names from the phenoData to filter the column names of the expression data. This is why I used rownames() instead of colnames(gset) like Kevin suggested.

This is a weird combination of tidyverse and base R so it may not be perfect but it definitely worked for me.

Ben

ADD COMMENT • link updated 2.4 years ago by GenoMax 152k • written 2.4 years ago by bennikkel • 0

Kevin Blighe · Accepted Answer · 2019-01-04

4

Entering edit mode

6.5 years ago

Kevin Blighe 89k

The way that ExpressionSet objects work, you can just filter on the main object, and the changes will then carry through to all sub-components of the object:

filter <- colnames(gset)[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A"]

length(filter)
[1] 157

gset.filt <- gset[,filter]

gset.filt

ExpressionSet (storageMode: lockedEnvironment)
assayData: 33297 features, 157 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1448335 GSM1448338 ... GSM1620804 (157 total)
  varLabels: title geo_accession ... samples collection:ch1 (34 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL6244

dim(exprs(gset.filt))
[1] 33297   157

dim(pData(gset.filt))
[1] 157  34

nrow(gset.filt@phenoData@data)
[1] 157

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

I did try to filter out the probes as suggested here, but the output turns out to be empty. I think it has to do with the different column names (GSMIDs) in eset and phenodata (Gender, Case/Control etc). Do I have to transpose one of them so that the filtering works? I plan to group the set in control and case groups and later perform MA plot between the groups. I am not sure if that's the right approach and would appreciate any suggestions.

This is what I tried

filter=colnames(g123)[g123@phenoData@data$"Source"=="CASE"] dim(filter) NULL

The source column in phenodata describes the sample status (case or control), and I would like to filter the samples based on their status.

This is my first post, and would apologize in advance for any missing information.

Thanks

ADD REPLY • link 4.7 years ago by vipulwagh31 ▴ 20

0

Entering edit mode

Try length(filter)

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for the response Kevin.

I did try length(filter), and yet the out seems to be empty.

ADD REPLY • link 4.7 years ago by vipulwagh31 ▴ 20

0

Entering edit mode

What are the contents of g123@phenoData@data$"Source"?

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Here are the contents of the g123@phenoData@data$"Source".

[1]  CASE     CASE     CONTROL  CASE     CASE     CASE     CASE     CONTROL  CONTROL
[10]  CONTROL  CASE     CONTROL  CONTROL  CASE     CASE     CASE     CASE     CASE   
[19]  CASE     CONTROL  CONTROL  CONTROL  CONTROL  CONTROL  CASE     CONTROL  CASE   
[28]  CONTROL  CASE     CASE     CONTROL  CONTROL  CASE     CASE     CASE     CONTROL
[37]  CASE     CONTROL  CONTROL  CASE     CASE     CASE     CASE     CASE     CONTROL
[46]  CASE     CONTROL  CASE     CONTROL  CASE     CONTROL  CONTROL  CONTROL  CASE   
[55]  CASE     CONTROL  CASE     CONTROL  CASE     CONTROL  CONTROL  CONTROL  CASE   
[64]  CONTROL  CONTROL  CASE   
Levels:  CASE  CONTROL

I just noticed that this column (source) is factorised, do you think that's the reason I cannot filter out the samples fro gset?

ADD REPLY • link updated 4.7 years ago by Kevin Blighe 89k • written 4.7 years ago by vipulwagh31 ▴ 20

0

Entering edit mode

Possibly. Try encoding as characters with as.character()

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Well I did manage to convert the phenoData of g123 into character using 'lapply' function. And still I cannot filter gset based on the column source in the phenodata.

Any more suggestions.

My gset is a LumiBatch object can that be an issue?

Thanks for your response again.

ADD REPLY • link 4.7 years ago by vipulwagh31 ▴ 20