how to download all the complete genomes for mycobacteria from NCBI?
1
0
Entering edit mode
6.9 years ago
Paul ▴ 80

How to download all the complete genomes for mycobacteria from NCBI?

I tried downloading the complete genomes from the NCBI site

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

But couldn't get the exact fasta files with respective mycobacteria. And https://www.ncbi.nlm.nih.gov/genome/?term=mycobacteria gave me 421 hits

genome NCBI sequence • 3.9k views
ADD COMMENT
5
Entering edit mode
6.9 years ago
5heikki 11k
#Get GenBank assembly summary file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

#Get all lines that have "Mycobacter", if 12th field is "Complete Genome", print the 20th field (url to file).
#But the actual filename ends _genomic.fna.gz so include that too..
grep Mycobacter assembly_summary_genbank.txt \
    | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' \
    | awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
    > urls.txt

#Now you can go through your urls file
IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done
ADD COMMENT
0
Entering edit mode

Thanks.. This worked

ADD REPLY
0
Entering edit mode

I tried your method but I have an empty urls.txt file. has the format changed please?

ADD REPLY
0
Entering edit mode

It hasn't changed. I just tried the above and see 2,481 Mycobacter genomes with the status "Complete Genome"..

ADD REPLY
0
Entering edit mode

OKAY, THANK YOU FOR YOUR ANSWER.

ADD REPLY
0
Entering edit mode
$ grep klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
0
$ grep Klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
1523
ADD REPLY
0
Entering edit mode

please, is it possible to put all the output sequences in one file (file with several FASTA files) ?

ADD REPLY
0
Entering edit mode
$ ls
file1.fna  file2.fna
$ cat file1.fna
>seq1
aaaaaaaaaa
$ cat file2.fna
>seq2
gggggg
$ cat file1.fna file2.fna > file3.fna
$ cat file3.fna
>seq1
aaaaaaaaaa
>seq2
gggggg
ADD REPLY
0
Entering edit mode

thank you very much for your answer. but i have 10668 outputs it doesn't have a command to add for example after "IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done" i tried IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done >doc.txt" it didn't work

ADD REPLY
0
Entering edit mode

The output files all end in ".gz", right?

So zcat *.gz > all.fna

zcat instead of cat because they're gz archieves

ADD REPLY
0
Entering edit mode

Hi, I'm trying to do this with python, I've already loaded my table with pandas, and I'd like to do the same thing I've got the FTP Path back but I have to go from :""ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/316/945/GCA_001316945.3_ASM131694v3"""" to this : """ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/316/945/GCA_001316945.3_ASM131694v3/GCA_001316945.3_ASM131694v3_genomic.fna.gz""""" Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6