Parsing an ExpressionSet for all SRA addresses
1
0
Entering edit mode
2.8 years ago
kmyers2 ▴ 60

I am using GEOquery to download the soft files for a number of experiments from NCBI GEO. For example, here is one individual experiment:

> soft <- getGEO('GSE104278', GSEMatrix=T)
> soft
$GSE104278_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 0 features, 12 samples element names: exprs protocolData: none phenoData sampleNames: GSM2793963 GSM2793964 ... GSM2793974 (12 total) varLabels: title geo_accession ... strain:ch1 (48 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' pubMedIds: 30456366 Annotation: GPL24048  I want to extract the NCBI addresses for all SRA entries present in this expression set, so I am using the following command: > sra <- data.frame(soft$GSE104278_series_matrix.txt.gz$relation.1) > sra soft.GSE104278_series_matrix.txt.gz.relation.1 1 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217300 2 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217301 3 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217302 4 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217303 5 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217304 6 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217305 7 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217306 8 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217307 9 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217308 10 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217309 11 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217310 12 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3217311  However, because I will be doing this on a several dozens to hundreds of files, I need a way to automate the sra data frame building step. I'd like to figure out how to automatically call the first "entry" (or column or whatever the ExprssionSet calls it) or pull out that entry to use it in the command. I'll need a new entry for each experiment I use, so typing them in one at a time is not practical. I've tried something like this: > sra <- data.frame(soft[1]$relation.1)


But that yields a data frame with 0 columns and 0 rows.

I've tried this:

> data <- soft[1,]$relation.1  or > data <- soft[,1]$relation.1


But that yields an error of "incorrect number of dimensions".

I'm sure there's an easy solution, but I'm just not seeing it. Any help would be greatly appreciated. Thanks!

RNA-Seq SRA R expressionSet parse • 887 views
3
Entering edit mode
2.8 years ago
Mark ★ 1.1k

I'm confused what you're trying to do exactly. But I think you want to call the variable inside the dataframe automatically. soft in this case is a list with named objects.

tmp <- names(soft)
tmp
[1] "GSE104278_series_matrix.txt.gz"


If the list contains multiple items, then you will get a vector that you can subset using the square bracket notation: tmp[1]. You can use this character to call the variable in the dataframe like this:

data <- soft[[tmp]]$relation.1  This is equivalent to this: data <- soft[["GSE104278_series_matrix.txt.gz"]]$relation.1


The double square bracket notation is used to reference a specific element in a list/vector.

Putting it all together, given a getGEO object (whatever it's called):

softGet <- function(soft) {
tmp <- names(soft)
lapply(tmp, function(x) (
data.frame(soft[[x]]\$relation.1)
))
}


This should return a list of dataframes containing the I think biosamples of each GSE object.

1
Entering edit mode

Wow! Thanks so much. That's exactly what I wanted. It makes sense and I understand better how the ExpressionSet is organized.

1
Entering edit mode

Glad I could help. Good luck.