Search for last Nuccore Entries
1
0
Entering edit mode
6.2 years ago

Hello everyone,

I'm trying to find an elegant solution to retrieve all sequence from Nuccore (nucléotide NCBI) that have been added since a timelaps (for exemple a week).

So far i found the genome report files, that contains a list of all genomes for a certain class of ornagism: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/viruses.txt (possible to parse and see what is new...)

I found that efetch and esearch allowed to search in pubmed with some dates parameters, but date search are not allowed for nuccore...

That's all I've got.

Any good idea is welcome

NCBI Eutilities Nucleotides • 1.6k views
2
Entering edit mode
6.2 years ago
5heikki 10k

With Entrez Direct, what has been published since October 2015.

esearch -db nuccore -query "("2015/10/01"[Publication Date] : "2015/11/09"[Publication Date])"

0
Entering edit mode

Well done, piped with efetch it's perfect:

esearch -db nuccore -query "("2015/11/08"[Publication Date] : "2015/11/09"[Publication Date])" | efetch -format fasta


Many Thanks!

1
Entering edit mode

Unfortunately far from perfect. Efetch quite often fails with larger downloads and doesn't necessarily even spit out a warning or anything. I would download the GIs instead of fasta and then to begin with check that the number of downloaded GIs is the same than:

esearch -db nuccore -query "("2015/10/01"[Publication Date] : "2015/11/09"[Publication Date])" | xtract -element Count


Then I'd split the list of GIs with split to e.g. 500 lines per file and then loop over those..

for f in *.splitFile
do
IDs=$(cat$f | tr "\n" "," | sed 's/,$//') epost -db nuccore -id$IDs | efetch -format fasta > \$f.fna
done


In addition you need to build some kind of check for these batch downloads. E.g. the file should have as many headers as there were lines in the id file. All is great then as long as download didn't fail in the middle of the last sequence :)