Question: Download Assembly Files from NCBI Genomes Site in Batch
0
gravatar for taraeicher
2.1 years ago by
taraeicher10
taraeicher10 wrote:

I'd like to download the assembly files for bacteria, archaea, virus, fungi, and protozoa from the NCBI website. Since there are so many files, it isn't practical for me to download each one manually. Using wget, I'm able to download at the directory level. For instance, using wget -r -l 20 --no-parent --reject "index.html*" "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/" gives me everything in the archaea directory for each species. The problem is that it skips the assembly directory, which is the part I really need. For instance, I get everything in ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Acidianus_hospitalis/ except for latest_assembly_versions/GCF_000213215.1_ASM21321v1, which is the assembly directory. Does anybody know how I can download this data in batch?

genome assembly ncbi • 1.7k views
ADD COMMENTlink modified 2.1 years ago by Istvan Albert ♦♦ 81k • written 2.1 years ago by taraeicher10
2
gravatar for Istvan Albert
2.1 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

This requires is a series of convoluted (as well as ridiculous) steps, as described in:

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

Approximately this for bacteria:

# Get the summary as a tabular text file.
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# Filter for complete genomes.
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths

# Identify the FASTA files (.fna.) other files may also be downloaded here.
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths

# Download everything in parallel
mkdir -p all
cat ftpfilepaths | parallel -j 20 --verbose --progress "cd all && curl -O {}"
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Istvan Albert ♦♦ 81k

Hi, I just wanted to say thanks for the solution. This worked well.

ADD REPLYlink written 2.0 years ago by taraeicher10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1858 users visited in the last hour