download refseq of thousand of assembly file from NCBI
1
0
Entering edit mode
6.2 years ago
Shelle ▴ 30

I want to download many bacteria fasta files with the .fna.gz extension from NCBI i have tried the commands below but none of them is working as it should. I do get the directory not the fasta files. Can anyone let me know what i should change to get the the ref seq of fasta files?

wget -b -r --no-parent -A 'GCF_*_genomic.fna.gz'  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ 

wget -b -r --no-parent accept-regex=*/latest_assembly_versions/*/*_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria
bash wget refseq fasta • 5.1k views
ADD COMMENT
0
Entering edit mode

Modify @5heikki's solution in how to download all the complete genomes for mycobacteria from NCBI?. It refers to just Mycobaterial genomes but you can remove that restriction.

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
$ cat assembly_summary_refseq.txt \
    | awk 'BEGIN{FS="\t"}{print $20}' \
    | awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
    > urls.txt

Limit to list you have from bacterial directory.

ADD REPLY
0
Entering edit mode

Thanks for your answer but what i want is the fasta file from this website : ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ link below can be one of the files that i am interested in:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Abditibacterium_utsteinense/latest_assembly_versions/GCF_002973605.1_ASM297360v1/GCF_002973605.1_ASM297360v1_genomic.fna.gz

I used the command you mentioned in your post but it doesn't give me right files!

ADD REPLY
0
Entering edit mode

That is not the final command. As I said you will need to limit the URL's produced if you need just bacterial data.

You may be better off using the program @joe included below. There you can just say ncbi-genome-download bacteria to get all bacterial genomes.

ADD REPLY
0
Entering edit mode

Thanks. I am using this. It gives me bunch of different directories. When i actually go into each directory, there is a file with content like below:

d3d4a4c01a15dee5a054b38a3178bf12  ./GCF_000007725.1_ASM772v1_assembly_report.txt
c132f1a3ba2b00383f2a1d92e4460e2b  ./GCF_000007725.1_ASM772v1_assembly_stats.txt
7a2f6dc85caefaf326362077f72bb1ad  ./GCF_000007725.1_ASM772v1_cds_from_genomic.fna.gz
7e65c3da25f5a35d8a7860d6c478bf67  ./GCF_000007725.1_ASM772v1_feature_count.txt.gz
2d82d4315ca7a2004a3b03bc55aa42af  ./GCF_000007725.1_ASM772v1_feature_table.txt.gz
576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz
5a491b9ae2550dd9b6379e4f9054c4a2  ./GCF_000007725.1_ASM772v1_genomic.gbff.gz
25b139d63e6cd46484ac27daa8532b79  ./GCF_000007725.1_ASM772v1_genomic.gff.gz
10e61215025d12b872b28847e4a389fa  ./GCF_000007725.1_ASM772v1_protein.faa.gz
88b5db0e6f27fd5455e76d2b9180a67b  ./GCF_000007725.1_ASM772v1_protein.gpff.gz
64ae08d0ceff2696234aded52fdf8955  ./GCF_000007725.1_ASM772v1_rna_from_genomic.fna.gz
02c785fd2336a0cc3fd20687d3053460  ./GCF_000007725.1_ASM772v1_translated_cds.faa.gz
313d29e74f85d37ae6d701f606f1acac  ./annotation_hashes.txt

I am only interested in "./GCF_000007725.1_ASM772v1_genomic.fna.gz". Does anyone know how i can extract this and work with it separately?

ADD REPLY
0
Entering edit mode

You could use something like find . -name "*.fna.gz" and move those files to a new location and then delete the rest of the files if you don't want to keep them.

ADD REPLY
0
Entering edit mode

The command you mentioned doesn't work unfortunately. I used this command "grep -w "GCF_000007365.1_ASM736v1_genomic.fna.gz" MD5SUMS.txt >> newfile/new.txt" to separate the fasta file but as a result the content of new.txt file would be something like below:

576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz

which isn't useful again as i want to work with this fasta file later on like decompress it...

ADD REPLY
0
Entering edit mode

For genomax's command to work you need to be working in a directory at the top of the tree. find is absolutely one of the best ways to do what you want, so you'll need to give us more info about what didn't work.

ADD REPLY
0
Entering edit mode

It worked thanks for your comment about on the top of tree! But how i can decompress them when they are in the format like below:

./bacteria/bacteriagz/GCF_000008885.1_ASM888v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000009305.1_ASM930v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000019705.1_ASM1970v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000024505.1_ASM2450v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000147695.2_ASM14769v3_genomic.fna.gz
./bacteria/bacteriagz/GCF_000156275.1_ASM15627v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000285255.1_ASM28525v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000287295.1_ASM28729v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000831225.1_ASM83122v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001700895.1_ASM170089v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001705605.1_ASM170560v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002083165.2_ASM208316v2_genomic.fna.gz
./bacteria/bacteriagz/GCF_002257505.1_ASM225750v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002849875.1_ASM284987v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002855775.1_ASM285577v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019755.1_ASM301975v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019785.1_ASM301978v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003034925.1_ASM303492v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003043915.1_ASM304391v1_genomic.fna.gz
ADD REPLY
0
Entering edit mode

Are you familiar with genomax's use of the . syntax with find? It's shorthand for my current directory.

More fully that command would look like:

find /path/to/search/from -name "some_string"
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi jrj.healey, I noticed this command 'find . -name ".fna.gz" ' is not working at all. Even if i changed '.' to my current path where all the downloaded bacteria directories are, no result when running the command. Even if i was at the top of directory, it gives me nothing! Earlier that i commented it worked, it was my mistake. I had some other fasta bacteria in a different directory and when i used 'find . -name ".fna.gz" ' those fasta files showed up. I deleted those to confirm if this command is working or not. It turned out when i downloaded the bacteria directories with this ncbi command, the find command in any of the format i have used is not working. Any idea to solve this issue?

ADD REPLY
1
Entering edit mode
6.2 years ago
Joe 21k

You should be able to use ncbi-genome-download for this I think.

https://github.com/kblin/ncbi-genome-download

ADD COMMENT

Login before adding your answer.

Traffic: 851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6