Question: Finding and downloading many Gene Expression Matrices from GEO?
gravatar for breckuh
7 months ago by
breckuh20 wrote:

We are setting up a pipeline that takes in single cell Gene Expression Matrices, runs them through a series of various preprocessing steps, and then trains various machine learning models to generate classifiers for label(s) on those datasets.

We'd like to build a collection of 1k datasets to test our pipeline against (~1% of GEO's GSE collection--the number could vary depending on submitted scRNAseq experiments with labels).

We are using the Bioconductor packages GEOquery and GEOmetadb. So far it's hard to figure out which GSEs have GEMs. Some do, some don't. Some just have links to GSMs. I wonder if I'm doing something dumb, or if most GSEs don't include GEMs?

Maybe someone with more experience using GEO could have some advice?

rna-seq next-gen • 250 views
ADD COMMENTlink modified 7 months ago by RamRS21k • written 7 months ago by breckuh20
gravatar for Kevin Blighe
7 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Sounds like a neat project.

Note that the Series Matrix Files, which virtually always contain the normalised expression data, may not always be listed on the GEO accession homepage; however, they can still be downloaded via GEOquery.

The easiest way to carry out your work would be to obtain your list of GEO data-sets of interest and to then download them via:

gset <- getGEO("GSE31432", GSEMatrix =TRUE, getGPL=FALSE)

For the vast majority of datasets, the normalsied expression data can then be readily accessed with:

if (length(gset) > 1) idx <- grep("GPL6947", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

You should prepare a list of accession IDs and aim to match them on various attributes in order to reduce bias in your analysis. There are many answers relating to the accessing of GEO data on Biostars. Seán has worked a lot on these, but he's now more active on Bioconductor forum.

Finally, I'm a skeptic about machine learning (ML). Whatever predictive power your algorithm eventually achieves, I guarantee you that I could do better via non-ML based algorithms :)


ADD COMMENTlink written 7 months ago by Kevin Blighe41k

Thanks Kevin! What I found was that for single cell experiments, the common practice was to store the matrices in the supplementary files section of a GSE entry. I was able to download and extract many thousands of matrices. More cleaning and careful mapping work to do, but looking forward to getting some results soon, and training some DL models to best other algorithms :).

ADD REPLYlink written 6 months ago by breckuh20

Great, hope that it all goes well!

ADD REPLYlink written 6 months ago by Kevin Blighe41k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2075 users visited in the last hour