How to export subset of metadata and expression data from BioConductor GEOquery?
1
3
Entering edit mode
10.8 years ago
William ★ 5.4k

I am planning to use Bioconductor GEOquery to download a couple of micro-array datasets from NCBI GEO.

Then I would like to export a subset of the metadata and the expression data to flat files that I can import elsewhere.

What I have so far is:

library(GEOquery)
library("R.utils")

geo_id <- "GSE45016"
gse <- getGEO(geo_id,GSEMatrix=FALSE)

#show metadata
Meta(gse)

#show metadata for first sample
GSMList(gse)[[1]]

#select specific field from metadata of first sample
GSMList(gse)[[1]]@header$characteristics_ch1

# Result for sample 1
[1] "tissue: normal prostate (NP) epithelial cells"

GSMList(gse)[[2]]@header$characteristics_ch1

# Result for sample 2
[1] "tissue: prostate cancer cells"   "clinical stage: clinical T4N0M1"
[3] "gleason score: GS 9"             "psa level: PSA 5477ng/ml"

As you can see the number of key value pairs is different for sample 1 and 2. What is would like to have is an array for every key under

@header$characteristics_ch1

and then the value or null (in case the key is missing) for every sample in the GEO dataset" ;

key_tissue: normal prostate (NP) epithelial cells\tprostate cancer cells
key_psa_level: null\tPSA 5477ng/ml

Other metadata fields like "title" luckily only have a single value beneath it.

GSMList(gse)[[1]]@header$title = "Normal prostate"
GSMList(gse)[[2]]@header$title = "High-grade PC1"

Also these I would like to have in an array for the key title.

My second question is how to export the expressions data that is stored under every sample. I would like to stream trough all the probes, get the expression values for that probe for each sample and write it to another csv file.

R GEO bioconductor • 10k views
ADD COMMENT
12
Entering edit mode
10.8 years ago
Neilfws 49k

I think that the way you have chosen to read the GSE data into R has created some confusion for you.

Try this instead (note: formatting was lost here so posted as a Gist):

gse <- getGEO("GSE45016") # you want GSEMatrix = TRUE
# ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/
# Found 1 file(s)
# GSE45016_series_matrix.txt.gz
# trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/GSE45016_series_matrix.txt.gz'
# ftp data connection made, file length 1486394 bytes
# opened URL
# ==================================================
# downloaded 1.4 Mb
#
# File stored at:
# /var/folders/7j/7r8lt_3s14s12jx4r2x_qc5c0000gn/T//Rtmpq73dcI/GPL570.soft
# gse is a list of length 1 named after series matrix; let's tidy it up
gse <- gse$GSE45016_series_matrix.txt.gz
gse
# ExpressionSet (storageMode: lockedEnvironment)
# assayData: 54675 features, 11 samples
# element names: exprs
# protocolData: none
# phenoData
# sampleNames: GSM1095876 GSM1095877 ... GSM1095886 (11 total)
# varLabels: title geo_accession ... data_row_count (34 total)
# varMetadata: labelDescription
# featureData
# featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
# fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
# fvarMetadata: Column Description labelDescription
# experimentData: use 'experimentData(object)'
# Annotation: GPL570
# now get the phenotypic data (covariates etc.) using pData()
pd <- pData(gse)
names(pd)
# [1] "title" "geo_accession" "status" "submission_date" "last_update_date" #
# [6] "type" "channel_count" "source_name_ch1" "organism_ch1" "characteristics_ch1"
# [11] "characteristics_ch1.1" "characteristics_ch1.2" "characteristics_ch1.3" "molecule_ch1" "extract_protocol_ch1"
# [16] "label_ch1" "label_protocol_ch1" "taxid_ch1" "hyb_protocol" "scan_protocol"
# [21] "description" "data_processing" "platform_id" "contact_name" "contact_email"
# [26] "contact_department" "contact_institute" "contact_address" "contact_city" "contact_state"
# [31] "contact_zip/postal_code" "contact_country" "supplementary_file" "data_row_count"
# sample 1 is normal tissue so does not have cancer-specific data values
pd$title
# V2 V3 V4 V5 V6 V7 V8 V9
# Normal prostate High-grade PC1 High-grade PC2 High-grade PC3 High-grade PC4 High-grade PC5 High-grade PC6 High-grade PC7
# V10 V11 V12
# High-grade PC8 High-grade PC9 High-grade PC10
# 11 Levels: High-grade PC1 High-grade PC10 High-grade PC2 High-grade PC3 High-grade PC4 High-grade PC5 High-grade PC6 ... Normal prostate
pd$characteristics_ch1
# V2 V3 V4
# tissue: normal prostate (NP) epithelial cells tissue: prostate cancer cells tissue: prostate cancer cells
# V5 V6 V7
# tissue: prostate cancer cells tissue: prostate cancer cells tissue: prostate cancer cells
# V8 V9 V10
# tissue: prostate cancer cells tissue: prostate cancer cells tissue: prostate cancer cells
# V11 V12
tissue: prostate cancer cells tissue: prostate cancer cells
# Levels: tissue: normal prostate (NP) epithelial cells tissue: prostate cancer cells
pd$characteristics_ch1.1
# V2 V3 V4 V5
# clinical stage: clinical T4N0M1 clinical stage: clinical T4N1M1 clinical stage: clinical T2bN1M1
# V6 V7 V8 V9
# clinical stage: clinical T3aN0M1 clinical stage: clinical T3bN0M1 clinical stage: clinical T4N1M1 clinical stage: clinical T3bN1M1
# V10 V11 V12
# clinical stage: clinical T3bN0M0 clinical stage: clinical T3aN0M0 clinical stage: clinical T3aN1M0
# 10 Levels: clinical stage: clinical T2bN1M1 clinical stage: clinical T3aN0M0 ... clinical stage: clinical T4N1M1
pd$characteristics_ch1.2
# V2 V3 V4 V5 V6 V7 V8
# gleason score: GS 9 gleason score: GS 9 gleason score: GS 9 gleason score: GS 9 gleason score: GS 9 gleason score: GS 9
# V9 V10 V11 V12
# gleason score: GS 8 gleason score: GS 9 gleason score: GS 9 gleason score: GS 8
# Levels: gleason score: GS 8 gleason score: GS 9
pd$characteristics_ch1.3
# V2 V3 V4 V5 V6
# psa level: PSA 5477ng/ml psa level: PSA 4427ng/ml psa level: PSA1900ng/ml psa level: PSA 630ng/ml
# V7 V8 V9 V10 V11
# psa level: PSA 334ng/ml psa level: PSA 311ng/ml psa level: PSA 1000ng/ml psa level: PSA 275ng/ml psa level: PSA 80ng/ml
# V12
# psa level: PSA 234ng/ml
# 11 Levels: psa level: PSA 1000ng/ml psa level: PSA 234ng/ml psa level: PSA 275ng/ml psa level: PSA 311ng/ml ... psa level: PSA1900ng/ml
view raw gistfile1.r hosted with ❤ by GitHub

As for exporting the expression data:

exp <- exprs(gse)

returns a matrix where the column names are sample names.

ADD COMMENT
1
Entering edit mode

Hi Neilfws, How did you write this reply? first it is in the gitbub and second how to prepare them in gitbub? thanks.

ADD REPLY
0
Entering edit mode

Nicely done.

ADD REPLY

Login before adding your answer.

Traffic: 2207 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6