How to download protozoa reference genomes from NCBI
2
3
Entering edit mode
3 months ago
anna ▴ 70

I want to download all available protozoa genomes from the NCBI database. Using ncbi-datasets download is unfortunately not an option, as it doesn't recognize "protozoa" as a valid taxon.

However, I found that the genomes are shared via FTP at the following locations: GenBank - protozoa RefSeq - protozoa

I tried downloading the contents using wget, but it only retrieves the directory listings — the actual genome files inside the subfolders (e.g., .fna.gz, .gbff.gz, etc.) are not being downloaded recursively.

This is the command I used: wget -r --continue --progress=bar:force:noscroll ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/

How can I modify this or use another method to properly download all genome files, including those in the subdirectories?

Any help or suggestions would be greatly appreciated!

P.S I'm also interested if there is any way to download only microfungi genomes.

genomes ncbi • 794 views
ADD COMMENT
4
Entering edit mode
3 months ago
anna ▴ 70

I found a solution of my problem. In case it will be helpful to anyone I leave a workflow here:

  1. Download assembly_summary.txt file

    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/assembly_summary.txt

  1. Extract FTP paths matching required parameters

    grep -v "^#" assembly_summary.txt | \ awk -F '\t' '

    $25 == "protozoa" && $11 == "latest" && $5 == "reference genome" {
    print $20 > "protozoa_ftp_paths.txt";
    print $0 > "protozoa_assembly_summary_filtered.txt"
       }
    
  1. Append the filename based on the directory name.
 mkdir -p protozoa_fna

    cd protozoa_fna

    while read dir; do
        base=$(basename "$dir")
        file="${base}_genomic.fna.gz"
        full_url="${dir}/${file}"
        echo "Downloading $full_url"
        wget -c "$full_url" --progress=bar:force:noscroll
    done < ../protozoa_ftp_paths.txt
ADD COMMENT
1
Entering edit mode
3 months ago

I know they're not the same, but you might be interested in https://protists.ensembl.org/index.html as well. I don't know how up to date and consistent it is compared with NCBI.

ADD COMMENT

Login before adding your answer.

Traffic: 4035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6