Question

How to automatically distinguish single vs. bulk GEO datasets (in ARCHS4 collection or in general) ?

0

Entering edit mode

2.8 years ago

Alexander ▴ 220

Beauty: ARCHS4 by A. Lachmann is a project which gives easy access to many GEO database genes expressions datasets ( Nature paper 2018 )

Pain: it easy to extract dataset by GSE-id , but it is not clear to me (and some my colleagues) how to understand is it single-cell or bulk expression dataset ?

Is there any way to do it in automatic way ? May be there is some list GSE-ids of single-cell datasets somewhere in inet ?
(Then I can check with this list, and take from ARCHS4 only those ids which are in it).

Or may be there is some easy way to parse GEO web-page by GSE-id and some fields will contain information on single or bulk data ?

Or some other trick ?

scRNAseq • 1.1k views

ADD COMMENT • link 2.8 years ago by Alexander ▴ 220

1

Entering edit mode

There is no automated way by best knowledge, at least no NCBI built-in function, but others may proof me wrong. Can you link a relevant accession, then we can try to point out some relevant points that may help.

ADD REPLY • link 2.8 years ago by ATpoint 82k

0

Entering edit mode

Thank you for your remark ! We are looking on ARCHS4 collection - so there are about 300 datasets with sample number greater than 100 for human and about the same for mouse. So we want to benchmark some our algs on ONLY single-cell datasets, no so clear how to distinguish single cell from bulk without much pain. Looking manually on 300+300 datasets kind of unpleasant) Some info on these datasets can be found e.g. here (scroll up few lines above the place linked): https://www.kaggle.com/alexandervc/archs4-extract-datasets-by-gse-and-show-info?scriptVersionId=68318866&cellId=16 - that is info in data included in ARCHS4.

ADD REPLY • link 2.8 years ago by Alexander ▴ 220

0

Entering edit mode

PS additional problem - some datasets like: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 contains BOTH single ( about 365 records ) and bulk ( about 49 records ) , while ARCHS4 stores both single and bulk data.

ADD REPLY • link 2.8 years ago by Alexander ▴ 220

1

Entering edit mode

Using EntrezDirect you can download information about the bioproject for this accession.

esearch -db sra -query  PRJNA339754 | efetch -format runinfo

gets you

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR4047736,2017-04-05 16:42:18,2016-08-23 15:36:24,754635,38486385,0,51,25,,https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos2/sra-pub-run-3/SRR4047736/SRR4047736.1,SRX2038571,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP082529,PRJNA339754,2,339754,SRS1632175,SAMN05606660,simple,9606,Homo sapiens,GSM2287384,,,,,,,no,,,,,GEO,SRA454047,,public,2E58AFD9C4D09A4B77BB08CDACA020BE,C7C740C130FFA215F737769EE72E2225
SRR4047737,2017-04-05 16:42:18,2016-08-23 15:36:04,618802,31558902,0,51,22,,https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos2/sra-pub-run-3/SRR4047737/SRR4047737.1,SRX2038572,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP082529,PRJNA339754,2,339754,SRS1632174,SAMN05606659,simple,9606,Homo sapiens,GSM2287385,,,,,,,no,,,,,GEO,SRA454047,,public,F3DAAE65EDC433A14EA47728A7A37830,EB472DBA1C944ED98695F83EE4D5C6AA

Which of these are single cell and which are bulk (I don't see a distinguishing feature in this output) so I can check. One example of single cell is fine.

ADD REPLY • link 2.8 years ago by GenoMax 141k

0

Entering edit mode

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 we can see:

............

GSM2287748 P24_H9b3s_095

GSM2287749 P24_H9b3s_096

GSM2287750 P48_H1_bulk_249

GSM2287751 P48_H1_bulk_250

.....

so last 48 GSM are bulk, and the others are single - that corresponds to paper

ADD REPLY • link 2.8 years ago by Alexander ▴ 220