Hi all,
If you work with single cell, you probably experienced many times how GEO/SRA butchers the "technical" reads, basically converting single cell 10X experiments into a strange sort of bulk RNA-seq.
Sometimes however the reads are available as 10X BAMs, that are submitted by the users. For example, for this run, you have an option to download the BAM from Amazon without having to pay cloud fees:
What I was wondering if it's possible to retrieve these links _en masse_ using something like Entrez utils? Normally, I would run something like this:
esearch -db sra -query SRR18070428 | efetch -format runinfo
However, the output in this case is as follows:
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR18070428,2023-07-01 15:41:47,2022-02-23 17:33:26,393550393,35419535370,0,90,5995,GCA_000001405.29,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-zq-20/SRR018/18070/SRR18070428/SRR18070428.lite.1,SRX14222131,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP360500,PRJNA808248,3,808248,SRS12043620,SAMN26038351,simple,9606,Homo sapiens,GSM5906307,,,,,,,no,,,,,GEO,SRA1374480,,public,94B93456534C81E666680118B95804E0,AE8023C53DB00098682F3B51E2D46143
As you can see, no Amazon links there. There's another tool that can look up files - namely, srapath
; however, running
srapath SRR18070428
produces an Amazon link to the same (useless) single-end SRA archive - not to the BAM file you can download manually.
If you have any ideas or knowledge as to how this could be automated, I would be most grateful.
All the best,
-- Alex
This information is not available via Entrezutils. You may need to scrape the pages above.
You meant _not_ available, right? Not sure NCBI allows scraping, at least my lame efforts at it failed. Oh well.
Yes. Corrected above.
Apparently it's possible! API way to find "Original submitter" files in SRA?
NCBI is like Bash - you use it for 10 years and there's still a ton of caveats and options you have no idea about.