Question

Parsing an ExpressionSet for all SRA addresses

0

Entering edit mode

4.5 years ago

kmyers2 ▴ 80

I am using GEOquery to download the soft files for a number of experiments from NCBI GEO. For example, here is one individual experiment:

> soft <- getGEO('GSE104278', GSEMatrix=T)
> soft
$GSE104278_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM2793963 GSM2793964 ... GSM2793974 (12 total)
  varLabels: title geo_accession ... strain:ch1 (48 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 30456366 
Annotation: GPL24048

I want to extract the NCBI addresses for all SRA entries present in this expression set, so I am using the following command:

> sra <- data.frame(soft$GSE104278_series_matrix.txt.gz$relation.1)
> sra
          soft.GSE104278_series_matrix.txt.gz.relation.1
1  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217300
2  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217301
3  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217302
4  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217303
5  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217304
6  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217305
7  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217306
8  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217307
9  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217308
10 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217309
11 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217310
12 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217311

However, because I will be doing this on a several dozens to hundreds of files, I need a way to automate the sra data frame building step. I'd like to figure out how to automatically call the first "entry" (or column or whatever the ExprssionSet calls it) or pull out that entry to use it in the command. I'll need a new entry for each experiment I use, so typing them in one at a time is not practical.

I've tried something like this:

> sra <- data.frame(soft[1]$relation.1)

But that yields a data frame with 0 columns and 0 rows.

I've tried this:

> data <- soft[1,]$relation.1

or

> data <- soft[,1]$relation.1

But that yields an error of "incorrect number of dimensions".

I'm sure there's an easy solution, but I'm just not seeing it. Any help would be greatly appreciated. Thanks!

RNA-Seq SRA R expressionSet parse • 1.3k views

ADD COMMENT • link updated 4.5 years ago by Mark ★ 1.5k • written 4.5 years ago by kmyers2 ▴ 80

score 3 · Accepted Answer · 2019-10-30

I'm confused what you're trying to do exactly. But I think you want to call the variable inside the dataframe automatically. soft in this case is a list with named objects.

tmp <- names(soft)
tmp
[1] "GSE104278_series_matrix.txt.gz"

If the list contains multiple items, then you will get a vector that you can subset using the square bracket notation: tmp[1]. You can use this character to call the variable in the dataframe like this:

data <- soft[[tmp]]$relation.1

This is equivalent to this:

data <- soft[["GSE104278_series_matrix.txt.gz"]]$relation.1

The double square bracket notation is used to reference a specific element in a list/vector.

Putting it all together, given a getGEO object (whatever it's called):

softGet <- function(soft) {
  tmp <- names(soft)
  lapply(tmp, function(x) (
    data.frame(soft[[x]]$relation.1)
    ))
}

This should return a list of dataframes containing the I think biosamples of each GSE object.