Question: Finding and downloading many Gene Expression Matrices from GEO?
gravatar for breckuh
7 weeks ago by
breckuh10 wrote:

We are setting up a pipeline that takes in single cell Gene Expression Matrices, runs them through a series of various preprocessing steps, and then trains various machine learning models to generate classifiers for label(s) on those datasets.

We'd like to build a collection of 1k datasets to test our pipeline against (~1% of GEO's GSE collection--the number could vary depending on submitted scRNAseq experiments with labels).

We are using the Bioconductor packages GEOquery and GEOmetadb. So far it's hard to figure out which GSEs have GEMs. Some do, some don't. Some just have links to GSMs. I wonder if I'm doing something dumb, or if most GSEs don't include GEMs?

Maybe someone with more experience using GEO could have some advice?

rna-seq next-gen • 141 views
ADD COMMENTlink modified 7 weeks ago by RamRS19k • written 7 weeks ago by breckuh10
gravatar for Kevin Blighe
7 weeks ago by
Kevin Blighe31k
Republic of Ireland
Kevin Blighe31k wrote:

Sounds like a neat project.

Note that the Series Matrix Files, which virtually always contain the normalised expression data, may not always be listed on the GEO accession homepage; however, they can still be downloaded via GEOquery.

The easiest way to carry out your work would be to obtain your list of GEO data-sets of interest and to then download them via:

gset <- getGEO("GSE31432", GSEMatrix =TRUE, getGPL=FALSE)

For the vast majority of datasets, the normalsied expression data can then be readily accessed with:

if (length(gset) > 1) idx <- grep("GPL6947", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

You should prepare a list of accession IDs and aim to match them on various attributes in order to reduce bias in your analysis. There are many answers relating to the accessing of GEO data on Biostars. Seán has worked a lot on these, but he's now more active on Bioconductor forum.

Finally, I'm a skeptic about machine learning (ML). Whatever predictive power your algorithm eventually achieves, I guarantee you that I could do better via non-ML based algorithms :)


ADD COMMENTlink written 7 weeks ago by Kevin Blighe31k

Thanks Kevin! What I found was that for single cell experiments, the common practice was to store the matrices in the supplementary files section of a GSE entry. I was able to download and extract many thousands of matrices. More cleaning and careful mapping work to do, but looking forward to getting some results soon, and training some DL models to best other algorithms :).

ADD REPLYlink written 27 days ago by breckuh10

Great, hope that it all goes well!

ADD REPLYlink written 27 days ago by Kevin Blighe31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1640 users visited in the last hour