Parsing an ExpressionSet for all SRA addresses
Entering edit mode
2.8 years ago
kmyers2 ▴ 60

I am using GEOquery to download the soft files for a number of experiments from NCBI GEO. For example, here is one individual experiment:

> soft <- getGEO('GSE104278', GSEMatrix=T)
> soft
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 12 samples 
  element names: exprs 
protocolData: none
  sampleNames: GSM2793963 GSM2793964 ... GSM2793974 (12 total)
  varLabels: title geo_accession ... strain:ch1 (48 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 30456366 
Annotation: GPL24048

I want to extract the NCBI addresses for all SRA entries present in this expression set, so I am using the following command:

> sra <- data.frame(soft$GSE104278_series_matrix.txt.gz$relation.1)
> sra
1  SRA:
2  SRA:
3  SRA:
4  SRA:
5  SRA:
6  SRA:
7  SRA:
8  SRA:
9  SRA:
10 SRA:
11 SRA:
12 SRA:

However, because I will be doing this on a several dozens to hundreds of files, I need a way to automate the sra data frame building step. I'd like to figure out how to automatically call the first "entry" (or column or whatever the ExprssionSet calls it) or pull out that entry to use it in the command. I'll need a new entry for each experiment I use, so typing them in one at a time is not practical.

I've tried something like this:

> sra <- data.frame(soft[1]$relation.1)

But that yields a data frame with 0 columns and 0 rows.

I've tried this:

> data <- soft[1,]$relation.1


> data <- soft[,1]$relation.1

But that yields an error of "incorrect number of dimensions".

I'm sure there's an easy solution, but I'm just not seeing it. Any help would be greatly appreciated. Thanks!

RNA-Seq SRA R expressionSet parse • 887 views
Entering edit mode
2.8 years ago
Mark ★ 1.1k

I'm confused what you're trying to do exactly. But I think you want to call the variable inside the dataframe automatically. soft in this case is a list with named objects.

tmp <- names(soft)
[1] "GSE104278_series_matrix.txt.gz"

If the list contains multiple items, then you will get a vector that you can subset using the square bracket notation: tmp[1]. You can use this character to call the variable in the dataframe like this:

data <- soft[[tmp]]$relation.1

This is equivalent to this:

data <- soft[["GSE104278_series_matrix.txt.gz"]]$relation.1

The double square bracket notation is used to reference a specific element in a list/vector.

Putting it all together, given a getGEO object (whatever it's called):

softGet <- function(soft) {
  tmp <- names(soft)
  lapply(tmp, function(x) (

This should return a list of dataframes containing the I think biosamples of each GSE object.

Entering edit mode

Wow! Thanks so much. That's exactly what I wanted. It makes sense and I understand better how the ExpressionSet is organized.

Entering edit mode

Glad I could help. Good luck.


Login before adding your answer.

Traffic: 1373 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6