how to download all the complete genomes for mycobacteria from NCBI?
1
0
Entering edit mode
3.9 years ago
Paul ▴ 80

How to download all the complete genomes for mycobacteria from NCBI?

I tried downloading the complete genomes from the NCBI site

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

But couldn't get the exact fasta files with respective mycobacteria. And https://www.ncbi.nlm.nih.gov/genome/?term=mycobacteria gave me 421 hits

genome NCBI sequence • 1.9k views
ADD COMMENT
5
Entering edit mode
3.9 years ago
5heikki 9.7k
#Get GenBank assembly summary file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

#Get all lines that have "Mycobacter", if 12th field is "Complete Genome", print the 20th field (url to file).
#But the actual filename ends _genomic.fna.gz so include that too..
grep Mycobacter assembly_summary_genbank.txt \
    | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' \
    | awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
    > urls.txt

#Now you can go through your urls file
IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done
ADD COMMENT
0
Entering edit mode

Thanks.. This worked

ADD REPLY
0
Entering edit mode

I tried your method but I have an empty urls.txt file. has the format changed please?

ADD REPLY
0
Entering edit mode

It hasn't changed. I just tried the above and see 2,481 Mycobacter genomes with the status "Complete Genome"..

ADD REPLY
0
Entering edit mode

OKAY, THANK YOU FOR YOUR ANSWER.

ADD REPLY
0
Entering edit mode
$ grep klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
0
$ grep Klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
1523
ADD REPLY

Login before adding your answer.

Traffic: 1757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6