Question: Download Assembly Files from NCBI Genomes Site in Batch
0
gravatar for taraeicher
18 months ago by
taraeicher10
taraeicher10 wrote:

I'd like to download the assembly files for bacteria, archaea, virus, fungi, and protozoa from the NCBI website. Since there are so many files, it isn't practical for me to download each one manually. Using wget, I'm able to download at the directory level. For instance, using wget -r -l 20 --no-parent --reject "index.html*" "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/" gives me everything in the archaea directory for each species. The problem is that it skips the assembly directory, which is the part I really need. For instance, I get everything in ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Acidianus_hospitalis/ except for latest_assembly_versions/GCF_000213215.1_ASM21321v1, which is the assembly directory. Does anybody know how I can download this data in batch?

genome assembly ncbi • 1.3k views
ADD COMMENTlink modified 18 months ago by Istvan Albert ♦♦ 80k • written 18 months ago by taraeicher10
2
gravatar for Istvan Albert
18 months ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

This requires is a series of convoluted (as well as ridiculous) steps, as described in:

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

Approximately this for bacteria:

# Get the summary as a tabular text file.
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# Filter for complete genomes.
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths

# Identify the FASTA files (.fna.) other files may also be downloaded here.
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths

# Download everything in parallel
mkdir -p all
cat ftpfilepaths | parallel -j 20 --verbose --progress "cd all && curl -O {}"
ADD COMMENTlink modified 18 months ago • written 18 months ago by Istvan Albert ♦♦ 80k

Hi, I just wanted to say thanks for the solution. This worked well.

ADD REPLYlink written 18 months ago by taraeicher10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 773 users visited in the last hour