How to automatically distinguish single vs. bulk GEO datasets (in ARCHS4 collection or in general) ?
0
0
Entering edit mode
2.8 years ago
Alexander ▴ 220

Beauty: ARCHS4 by A. Lachmann is a project which gives easy access to many GEO database genes expressions datasets ( Nature paper 2018 )

Pain: it easy to extract dataset by GSE-id , but it is not clear to me (and some my colleagues) how to understand is it single-cell or bulk expression dataset ?

Is there any way to do it in automatic way ? May be there is some list GSE-ids of single-cell datasets somewhere in inet ?
(Then I can check with this list, and take from ARCHS4 only those ids which are in it).

Or may be there is some easy way to parse GEO web-page by GSE-id and some fields will contain information on single or bulk data ?

Or some other trick ?

scRNAseq • 1.1k views
ADD COMMENT
1
Entering edit mode

There is no automated way by best knowledge, at least no NCBI built-in function, but others may proof me wrong. Can you link a relevant accession, then we can try to point out some relevant points that may help.

ADD REPLY
0
Entering edit mode

Thank you for your remark ! We are looking on ARCHS4 collection - so there are about 300 datasets with sample number greater than 100 for human and about the same for mouse. So we want to benchmark some our algs on ONLY single-cell datasets, no so clear how to distinguish single cell from bulk without much pain. Looking manually on 300+300 datasets kind of unpleasant) Some info on these datasets can be found e.g. here (scroll up few lines above the place linked): https://www.kaggle.com/alexandervc/archs4-extract-datasets-by-gse-and-show-info?scriptVersionId=68318866&cellId=16 - that is info in data included in ARCHS4.

ADD REPLY
0
Entering edit mode

PS additional problem - some datasets like: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 contains BOTH single ( about 365 records ) and bulk ( about 49 records ) , while ARCHS4 stores both single and bulk data.

ADD REPLY
1
Entering edit mode

Using EntrezDirect you can download information about the bioproject for this accession.

esearch -db sra -query  PRJNA339754 | efetch -format runinfo

gets you

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR4047736,2017-04-05 16:42:18,2016-08-23 15:36:24,754635,38486385,0,51,25,,https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos2/sra-pub-run-3/SRR4047736/SRR4047736.1,SRX2038571,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP082529,PRJNA339754,2,339754,SRS1632175,SAMN05606660,simple,9606,Homo sapiens,GSM2287384,,,,,,,no,,,,,GEO,SRA454047,,public,2E58AFD9C4D09A4B77BB08CDACA020BE,C7C740C130FFA215F737769EE72E2225
SRR4047737,2017-04-05 16:42:18,2016-08-23 15:36:04,618802,31558902,0,51,22,,https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos2/sra-pub-run-3/SRR4047737/SRR4047737.1,SRX2038572,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP082529,PRJNA339754,2,339754,SRS1632174,SAMN05606659,simple,9606,Homo sapiens,GSM2287385,,,,,,,no,,,,,GEO,SRA454047,,public,F3DAAE65EDC433A14EA47728A7A37830,EB472DBA1C944ED98695F83EE4D5C6AA

Which of these are single cell and which are bulk (I don't see a distinguishing feature in this output) so I can check. One example of single cell is fine.

ADD REPLY
0
Entering edit mode

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 we can see:

............

GSM2287748 P24_H9b3s_095

GSM2287749 P24_H9b3s_096

GSM2287750 P48_H1_bulk_249

GSM2287751 P48_H1_bulk_250

.....

so last 48 GSM are bulk, and the others are single - that corresponds to paper

ADD REPLY

Login before adding your answer.

Traffic: 2980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6