Python fast way to get ONLY MAIN metadata for GSE ? (not walking through thousands underlying GSM-samples : slow or even endless)
1
0
Entering edit mode
16 months ago
Alexander ▴ 60

Context: Each GSE page contains some metadata like: title, summary , Overall design, Citation... see e.g. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse118723

There are nice packages to get metadata for GSE e.g. GEOparse, FFQ: https://geoparse.readthedocs.io/en/latest/GEOparse.html https://pypi.org/project/ffq/

Problem: However there is some problem - if there are many underlying Samples (7584) ( GSMxxxxx ), then the packages seems to work incredibly slow.

Question: is there any Python way to get main metadata for GSE (without these tons of GSM-samples) in reasonably fast way ? May be there is option in these packages not to retrieve samples info ? Or may be there is query to GEO site to get GSE/title/summary/... table of information - that would fit my current need.

PS

Some experiments: !ffq -t GSE -o gse_single.json GSE118723 seems does not finish in 9 hours. https://www.kaggle.com/alexandervc/play-with-ffq-package-get-info-by-gseid-etc (version 3 of than notebook).

GEOParse is also unsatisfactory: https://www.kaggle.com/alexandervc/geoparse

GEO GEOparse • 1.5k views
2
Entering edit mode
16 months ago
GenoMax 122k

Not Python but using EntrezDirect you can get:

$esearch -db bioproject -query "GSE118723" | esummary | xtract -pattern DocumentSummary -element Project_Description Quantification of gene expression levels at the single cell level has revealed that gene expression can vary substantially even across a population of homogeneous cells. However, it is currently unclear what genomic features control variation in gene expression levels, and whether common genetic variants may impact gene expression variation. Here, we take a genome-wide approach to identify expression variance quantitative trait loci (vQTLs). To this end, we generated single cell RNA-seq (scRNA-seq) data from induced pluripotent stem cells (iPSCs) derived from 53 Yoruba individuals. We collected data for a median of 95 cells per individual and a total of 5,447 single cells, and identified 241 mean expression QTLs (eQTLs) at 10% FDR, of which 82% replicate in bulk RNA-seq data from the same individuals. We further identified 14 vQTLs at 10% FDR, but demonstrate that these can also be explained as effects on mean expression. Our study suggests that dispersion QTLs (dQTLs), which could alter the variance of expression independently of the mean, have systematically smaller effect sizes than eQTLs. We estimate that at least 300 cells per individual and 400 individuals would be required to have modest power to detect the strongest dQTLs in iPSCs. These results will guide the design of future studies on understanding the genetic control of gene expression variance. Overall design: The goal of our study was to identify quantitative trait loci associated with gene expression variance across cells (vQTLs). Using the Fluidigm C1 platform, we isolated and collected scRNA-seq from 7,585 single cells from induced pluripotent stem cell (iPSC) lines of 54 Yoruba in Ibadan, Nigeria (YRI) individuals. We used unique molecular identifiers (UMIs) to tag RNA molecules and account for amplification bias in the single cell data (Islam et al., 2014). To estimate technical confounding effects without requiring separate technical replicates, we used a mixed-individual plate study design. The key idea of this approach is that having observations from the same individual under different confounding effects and observations from different individuals under the same confounding effect allows us to distinguish the two sources of variation (Tung et al., 2017).  As for the samples you can do something like: $ esearch -db bioproject -query "GSE118723" | elink -target biosample | efetch | head -20
1: 11032017-C12-NA19226
Identifiers: BioSample: SAMN09855354; SRA: SRS3686681; GEO: GSM3341993
Organism: Homo sapiens
Attributes:
/source name="LCL-derived iPSC"
/experiment="11032017"
/well="C12"
/individual="NA19226"
/batch="b4"
Accession: SAMN09855354 ID: 9855354

2: 11062017-E01-NA19099
Identifiers: BioSample: SAMN09858071; SRA: SRS3689110; GEO: GSM3342102
Organism: Homo sapiens
Attributes:
/source name="LCL-derived iPSC"
/experiment="11062017"
/well="E01"
/individual="NA19099"
/batch="b4"

0
Entering edit mode

Thank you very much ! Is there any way to run it from Jupiter like environment ? It seems it can be installed But then : "/bin/bash: esearch: command not found"

1
Entering edit mode

That may simply be a $PATH problem. Find out where esearch (and other programs are) and then add that directory to your $PATH (export PATH=$PATH:/dir_with_entrezdirect_progs). You can also install using conda. ADD REPLY 0 Entering edit mode Thank you very much ! Indeed it seems problem with PATH, however it seems I cannot resolve it . I am trying !export PATH=/root/edirect/:$PATH

But it does not seems to change PATH. https://www.kaggle.com/alexandervc/entrezdirect?scriptVersionId=70892045&cellId=17

Conda install also does not seems to work properly on kaggle https://www.kaggle.com/alexandervc/entrezdirect?scriptVersionId=70892045&cellId=3

1
Entering edit mode

Is /root/edirect an actual directory? Are you able to do ls -l /root/edirect and get a listing? I assume ! is because of the kaggle env that you are using but the command looks correct and it is not modifying the $PATH. Not sure if you need sudo access to make changes. ADD REPLY 0 Entering edit mode It seems '/root/edirect/' is directory, Both !ls /root/edirect/ and os.listdir('/root/edirect/') work: https://www.kaggle.com/alexandervc/entrezdirect?scriptVersionId=70900026&cellId=13 ADD REPLY 1 Entering edit mode It looks like each command in that pipe is being executed in a separate shell and that is why you are losing the $PATH setting for those sub shells. Can you find a proper unix shell somewhere for this?

0
Entering edit mode

Thank you ! "each command in that pipe is being executed in a separate shell " - very interesing idea, never could imagine it. Do not you know is it the same in Jupiter notebooks ? Colab ? "Can you find a proper unix shell somewhere for this?" any suggestion ?

0
Entering edit mode

If it is seprate shell that it is indeed for each commant, not even for particular notebook-cell, since putting there commands to one notebook cell, does not change anything: https://www.kaggle.com/alexandervc/entrezdirect?scriptVersionId=70919686&cellId=22

1
Entering edit mode

Separate shell is NOT desired for each command since the output of first command is being passed through second and then third. I just speculated that your kaggle environment is possibly using a sub-shell based on the error you are seeing.

I was referring to a unix shell that you can get via a proper unix machine, a virtual machine running linux/unix or even Windows Subsystem for Linux (WSL) on Windows.

0
Entering edit mode

Finally, it seems works even on kaggle ! There is a way to change PATH on kaggle which works:

os.environ['PATH'] = "/kaggle/working:" + os.environ['PATH']

So here is example how to use entrezdirect

https://www.kaggle.com/alexandervc/entrezdirect?scriptVersionId=71645515