How to download COMPLETE bacterial genomes from NCBI based on list of names?
1
1
Entering edit mode
3.5 years ago
taojincs ▴ 50

I have a long list of complete bacterial organism names (more than 100000, thus impossible to search and download it line by line). Format is one name on each line. I need to download GCA (It must be GCA instead of GCF) fasta files of the corresponding genomes from https://www.ncbi.nlm.nih.gov/genome/browse/ (Specify Levels as Complete).

I have to achieve this through command lines. How to do it efficiently? Thank you.

search ncbi • 2.9k views
0
Entering edit mode

This code didn't generate anything for me. Also it didn't give me any error. Did you manage to solve the issue?

0
Entering edit mode
5
Entering edit mode
3.5 years ago
5heikki 9.7k
cat species.txt
Porphyromonas levii
Porphyromonas somerae

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

IFS=$'\n'; for next in$(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES && $12=="Complete Genome"){print$20}}' assembly_summary.txt \
| awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}'; done \
| sh


NOTE: Only 8,413 Bacterial genomes have "Complete Genome" assembly level status (not even 10% of your list of names). For example, nothing will be downloaded in the example shown above. Do you really need to limit yourself to such a small subset?

  1577 Chromosome
8413 Complete Genome
52594 Contig
54565 Scaffold

0
Entering edit mode

This didn't download the fasta file in my directory. Nothing happened. Could you please double check it?

0
Entering edit mode

You need your list of species in the same directory where you run it. In my example the list is called species.txt. Modify accordingly.

0
Entering edit mode

Hi if I want to download proteomes for organisms that have been completely sequenced, should I only change "wget "$0,$NF"_genomic.fna.gz" to "wget "$0,$NF"_protein.faa.gz"? It seems what I downloaded is much larger than it should be. For example, I downloaded 11.9 GB of sequence data for 50 given organisms while someone else who worked on the same list downloaded 68MB. Thanks.