Question

Having issues downloading reference bacteria genome from NCBI FTP website

0

Entering edit mode

2.4 years ago

krastegar0 • 0

Hi everyone I am new to bioinformatics and I am working on my thesis project which requires me to download reference bacteria genome from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt. I am super green in this field so I really don't know what I am doing. Here is the code that I was given to download the raw fastq files.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt grep 'Complete Genome' assembly_summary.txt \ assembly_summary_complete_latest_reference_genomes.txt awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt \ assembly_summary_complete_latest_reference_genomes_paths.txt mkdir BacterialGenomes for i in $(cat assembly_summary_complete_latest_reference_genomes_paths.txt) do wget -P BacterialGenomes ${i}/*genomic.fna.gz done

When I run this script I get stuck in an infinite loop with the same error messages (posted below): I am using Linux with Ubuntu (just in case anyone is wondering).

Warning: wildcards not supported in HTTP. --2021-12-25 21:14:23-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/157/365/GCA_002157365.2_ASM215736v2/*genomic.fna.gz Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 130.14.250.10, 2607:f220:41f:250::230, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2021-12-25 21:14:23 ERROR 404: Not Found.

Thank you for any help you may be able to provide!

wget troubleshooting Linux • 1.7k views

ADD COMMENT • link updated 2.3 years ago by MirianT_NCBI ▴ 720 • written 2.4 years ago by krastegar0 • 0

0

Entering edit mode

I also tried doing this in R using

biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

but I get an error saying

The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/GCF_900128725.1_BCifornacula_v1.0_genomic.fna.gz' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/md5checksums.txt' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
Genome download of Buchnera_aphidicola is completed!
The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0/GCF_900128725.1_BCifornacula_v1.0_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file.                                                                        
In addition: There were 11 warnings (use warnings() to see them)

ADD REPLY • link updated 2.3 years ago by GenoMax 142k • written 2.4 years ago by krastegar0 • 0

score 3 · Answer 1 · 2021-12-28

Hi, Based on your question, I assume you're trying to download all bacterial reference genomes from NCBI, right? You can use the NCBI datasets command line tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) for that. Here's the GitHub page with more info if that's helpful.

After you download the program (which can also be installed using conda), here are the steps:

Download a dehydrated data package that contains metadata and the paths to all reference bacterial genomes (as reference, I'm assuming you mean all bacterial genomes with GCF accession numbers):

datasets download genome bacteria --assembly-source refseq --dehydrated --filename bacteria_refseq.zip

Unzip the file
unzip bacteria_refseq.zip -d bacteria_refseq

Rehydrate the file

datasets rehydrate --directory bacteria_refseq/

I'm recommending the dehydrated option because it's actually faster and more reliable, despite the additional steps. By default, the data package includes genomic, transcript, protein and cds sequences, in addition to gff3. If you only need the genomic fasta sequences, you can use this command instead:

datasets download genome bacteria --assembly-source refseq \
--dehydrated --exclude-protein --exclude-genomic-cds \
--exclude-rna --exclude-gff3 --filename bacteria_refseq_fasta.zip

After that, you can follow the steps 2 and 3 in the same way.

Let me know if that works or if you have any other questions. :)