Question

Search for last Nuccore Entries

0

Entering edit mode

8.5 years ago

emmanuel.bouilhol ▴ 20

Hello everyone,

I'm trying to find an elegant solution to retrieve all sequence from Nuccore (nucleotide NCBI) that have been added since a time-lapse (for example a week).

So far I found the genome report files, that contains a list of all genomes for a certain class of organism: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/viruses.txt (possible to parse and see what is new...)

I found that efetch and esearch allowed to search in pubmed with some dates parameters, but date search are not allowed for nuccore.

That's all I've got.

Any good idea is welcome

Thanks for your help

Eutilities NCBI Nucleotides • 2.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by emmanuel.bouilhol ▴ 20

Ram · Accepted Answer · 2015-11-09

2

Entering edit mode

8.5 years ago

5heikki 11k

With Entrez Direct, what has been published since October 2015.

esearch -db nuccore -query "("2015/10/01"[Publication Date] : "2015/11/09"[Publication Date])"

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by 5heikki 11k

0

Entering edit mode

Well done, piped with efetch it's perfect:

esearch -db nuccore -query "("2015/11/08"[Publication Date] : "2015/11/09"[Publication Date])" | efetch -format fasta

Many Thanks!

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by emmanuel.bouilhol ▴ 20

1

Entering edit mode

Unfortunately far from perfect. Efetch quite often fails with larger downloads and doesn't necessarily even spit out a warning or anything. I would download the GIs instead of fasta and then to begin with check that the number of downloaded GIs is the same than:

esearch -db nuccore -query "("2015/10/01"[Publication Date] : "2015/11/09"[Publication Date])" | xtract -element Count

Then I'd split the list of GIs with split to e.g. 500 lines per file and then loop over those..

for f in *.splitFile
do
    IDs=$(cat $f | tr "\n" "," | sed 's/,$//')
    epost -db nuccore -id $IDs | efetch -format fasta > $f.fna
done

In addition you need to build some kind of check for these batch downloads. E.g. the file should have as many headers as there were lines in the id file. All is great then as long as download didn't fail in the middle of the last sequence :)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by 5heikki 11k