How to find newly submitted accessions in NCBI
2
1
Entering edit mode
9 weeks ago
LDT ▴ 220

Dear all,

I want to automate a process to identify newly submitted plant accessions in NCBI. I am scanning the NCBI FTP server, but I have not yet found any address to locate all SRA accessions.

https://ftp.ncbi.nlm.nih.gov/

Does anybody have an idea where I could find this list?

ncbi • 496 views
2
Entering edit mode
9 weeks ago

If you are looking for SRA accession numbers you should search the SRA database

or from the command line, I gave it a go:

esearch -db sra -query '"2022/11/28"[Publication Date]' | efetch -format runinfo > 2022-11-18.csv


how may lines?

cat 2022-11-18.csv | wc -l


prints:

4638


looks like today Nov 11, 2022 there were 4638 datasets deposited at SRA ... whoa, I did not expect that ... I am extraordinarily surprised to be honest. That is a lot of data.

What is the size of all that data?

 cat 2022-11-18.csv | csvcut -c size_MB | grep -v size | datamash sum 1


prints:

 2471916


which ends up about 2.4 terrabytes.

0
Entering edit mode

This is extremely cool, Istvan and I want to thank you for being so helpful to us. One question? Is there a way that I can focus the search only on plants, animals or bacteria?

1
Entering edit mode

Technically there is a field for TaxID in the output that runinfo option in the command above but it is sadly not populated for many entries (certainly not for new ones). I checked on that yesterday. You can add a TaxID number to the query in the first part of the command.

0
Entering edit mode

thank you so much GenoMax :)

1
Entering edit mode
9 weeks ago
GenoMax 125k

NCBI publishes a file containing SRA accession numbers. It is updated daily (file is almost a gigabyte so a largeish download). It appears to have accession numbers that start a ways back and are current up to a given date.

$head NCBI_SRA_Datalist Submission Run Date DRA000001 DRR000001 2014-05-26T10:22:28Z DRA000002 DRR000002 2014-05-26T11:00:19Z DRA000003 DRR000003 2014-05-26T11:07:49Z DRA000003 DRR000004 2014-05-26T11:07:46Z$ tail NCBI_SRA_Datalist

SRA1548151  SRR22428598 2022-11-28T18:25:46Z
SRA1548154  SRR22428656 2022-11-28T18:34:47Z
SRA1548154  SRR22428657 2022-11-28T18:33:44Z
SRA1548154  SRR22428658 2022-11-28T18:33:31Z

0
Entering edit mode

This is so cool! I was wondering how I can find the new plant species from there, for example. Do you have an idea? Thank you so much for your time and suggestions