Question: How to download COMPLETE bacterial genomes from NCBI based on list of names?
1
gravatar for taojincs
16 months ago by
taojincs20
taojincs20 wrote:

I have a long list of complete bacterial organism names (more than 100000, thus impossible to search and download it line by line). Format is one name on each line. I need to download GCA (It must be GCA instead of GCF) fasta files of the corresponding genomes from https://www.ncbi.nlm.nih.gov/genome/browse/ (Specify Levels as Complete).

I have to achieve this through command lines. How to do it efficiently? Thank you.

search ncbi • 1.6k views
ADD COMMENTlink modified 13 months ago by arsilan32470 • written 16 months ago by taojincs20

This code didn't generate anything for me. Also it didn't give me any error. Did you manage to solve the issue?

ADD REPLYlink written 13 months ago by arsilan32470

My answer: C: How to retrieve single protein fasta file for multiple species?

ADD REPLYlink written 13 months ago by genomax63k
3
gravatar for 5heikki
16 months ago by
5heikki8.2k
Finland
5heikki8.2k wrote:
cat species.txt
Porphyromonas levii
Porphyromonas somerae

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES && $12=="Complete Genome"){print $20}}' assembly_summary.txt \
    | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}'; done \
    | sh

NOTE: Only 8,413 Bacterial genomes have "Complete Genome" assembly level status (not even 10% of your list of names). For example, nothing will be downloaded in the example shown above. Do you really need to limit yourself to such a small subset?

  1577 Chromosome
  8413 Complete Genome
  52594 Contig
  54565 Scaffold
ADD COMMENTlink modified 16 months ago • written 16 months ago by 5heikki8.2k

This didn't download the fasta file in my directory. Nothing happened. Could you please double check it?

ADD REPLYlink written 16 months ago by taojincs20

You need your list of species in the same directory where you run it. In my example the list is called species.txt. Modify accordingly.

ADD REPLYlink written 16 months ago by 5heikki8.2k

Hi if I want to download proteomes for organisms that have been completely sequenced, should I only change "wget "$0,$NF"_genomic.fna.gz" to "wget "$0,$NF"_protein.faa.gz"? It seems what I downloaded is much larger than it should be. For example, I downloaded 11.9 GB of sequence data for 50 given organisms while someone else who worked on the same list downloaded 68MB. Thanks.

ADD REPLYlink written 14 months ago by taojincs20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1498 users visited in the last hour