How to automatically distinguish single vs. bulk GEO datasets (in ARCHS4 collection or in general) ?
0
0
Entering edit mode
7 days ago
Alexander ▴ 20

Beauty: ARCHS4 by A. Lachmann is a project which gives easy access to many GEO database genes expressions datasets ( Nature paper 2018 )

Pain: it easy to extract dataset by GSE-id , but it is not clear to me (and some my colleagues) how to understand is it single-cell or bulk expression dataset ?

Is there any way to do it in automatic way ? May be there is some list GSE-ids of single-cell datasets somewhere in inet ?
(Then I can check with this list, and take from ARCHS4 only those ids which are in it).

Or may be there is some easy way to parse GEO web-page by GSE-id and some fields will contain information on single or bulk data ?

Or some other trick ?

scRNAseq • 310 views
1
Entering edit mode

There is no automated way by best knowledge, at least no NCBI built-in function, but others may proof me wrong. Can you link a relevant accession, then we can try to point out some relevant points that may help.

0
Entering edit mode

Thank you for your remark ! We are looking on ARCHS4 collection - so there are about 300 datasets with sample number greater than 100 for human and about the same for mouse. So we want to benchmark some our algs on ONLY single-cell datasets, no so clear how to distinguish single cell from bulk without much pain. Looking manually on 300+300 datasets kind of unpleasant) Some info on these datasets can be found e.g. here (scroll up few lines above the place linked): https://www.kaggle.com/alexandervc/archs4-extract-datasets-by-gse-and-show-info?scriptVersionId=68318866&cellId=16 - that is info in data included in ARCHS4.

0
Entering edit mode

PS additional problem - some datasets like: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 contains BOTH single ( about 365 records ) and bulk ( about 49 records ) , while ARCHS4 stores both single and bulk data.

0
Entering edit mode

esearch -db sra -query  PRJNA339754 | efetch -format runinfo


gets you

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash


Which of these are single cell and which are bulk (I don't see a distinguishing feature in this output) so I can check. One example of single cell is fine.

0
Entering edit mode

............

GSM2287748 P24_H9b3s_095

GSM2287749 P24_H9b3s_096

GSM2287750 P48_H1_bulk_249

GSM2287751 P48_H1_bulk_250

.....

so last 48 GSM are bulk, and the others are single - that corresponds to paper