How to download COMPLETE bacterial genomes from NCBI based on list of names?
1
1
Entering edit mode
3.5 years ago
taojincs ▴ 50

I have a long list of complete bacterial organism names (more than 100000, thus impossible to search and download it line by line). Format is one name on each line. I need to download GCA (It must be GCA instead of GCF) fasta files of the corresponding genomes from https://www.ncbi.nlm.nih.gov/genome/browse/ (Specify Levels as Complete).

I have to achieve this through command lines. How to do it efficiently? Thank you.

search ncbi • 2.9k views
ADD COMMENT
0
Entering edit mode

This code didn't generate anything for me. Also it didn't give me any error. Did you manage to solve the issue?

ADD REPLY
0
5
Entering edit mode
3.5 years ago
5heikki 9.7k
cat species.txt
Porphyromonas levii
Porphyromonas somerae

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES && $12=="Complete Genome"){print $20}}' assembly_summary.txt \
    | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}'; done \
    | sh

NOTE: Only 8,413 Bacterial genomes have "Complete Genome" assembly level status (not even 10% of your list of names). For example, nothing will be downloaded in the example shown above. Do you really need to limit yourself to such a small subset?

  1577 Chromosome
  8413 Complete Genome
  52594 Contig
  54565 Scaffold
ADD COMMENT
0
Entering edit mode

This didn't download the fasta file in my directory. Nothing happened. Could you please double check it?

ADD REPLY
0
Entering edit mode

You need your list of species in the same directory where you run it. In my example the list is called species.txt. Modify accordingly.

ADD REPLY
0
Entering edit mode

Hi if I want to download proteomes for organisms that have been completely sequenced, should I only change "wget "$0,$NF"_genomic.fna.gz" to "wget "$0,$NF"_protein.faa.gz"? It seems what I downloaded is much larger than it should be. For example, I downloaded 11.9 GB of sequence data for 50 given organisms while someone else who worked on the same list downloaded 68MB. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6