I went to https://www.ncbi.nlm.nih.gov/assembly, typed in my organism, and now I want to download all of the assemblies that pop up. If I click [Download Assemblies] then it only downloads 1/22 of them and it's been saying "calculating size..." for about 30 minutes now. I tried using https://github.com/kblin/ncbi-genome-download but not all of the records were downloaded
If you have a list of NCBI assembly accessions with GCA or GCF prefixes, the easiest way to download data is to use the new tool NCBI Datasets. There is a web-interface to do this, if you don't want to bother with command line. If you do want to use the command line, you can use the
datasets CLI tool as follows:
datasets download genome accession --inputfile assm_accs.txt --exclude-gff3 --exclude-protein --exclude-rna
assm_accs.txt has NCBI assembly accessions, one per each line.
Note Currently, only the latest assembly accessions for a taxon are in the scope of NCBI Datasets. If you want to download older assemblies, you will have to use Entrez Direct as follows:
esearch -db assembly -query 'Bos taurus[organism] AND latest[filter]' \ | esummary \ | xtract -pattern DocumentSummary -element FtpPath_GenBank \ | while read -r line ; do fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ; wget "$line/$fname" ; done
Here, I am first fetching the FTP path for the GenBank assembly using
edirect tools and then use standard linux commands to download the genomic fasta file.
If your starting point is a file with a list of NCBI assembly accessions, you can wrap the command shown above in a bash loop like this:
cat assm_accs.txt | while read -r acc ; do esearch -db assembly -query $acc </dev/null \ | esummary \ | xtract -pattern DocumentSummary -element FtpPath_GenBank \ | while read -r url ; do fname=$(echo $url | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ; wget "$url/$fname" ; done ; done