Entering edit mode
6.2 years ago
Shelle
▴
30
I want to download many bacteria fasta files with the .fna.gz extension from NCBI i have tried the commands below but none of them is working as it should. I do get the directory not the fasta files. Can anyone let me know what i should change to get the the ref seq of fasta files?
wget -b -r --no-parent -A 'GCF_*_genomic.fna.gz' ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/
wget -b -r --no-parent accept-regex=*/latest_assembly_versions/*/*_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria
Modify @5heikki's solution in how to download all the complete genomes for mycobacteria from NCBI?. It refers to just Mycobaterial genomes but you can remove that restriction.
Limit to list you have from bacterial directory.
Thanks for your answer but what i want is the fasta file from this website : ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ link below can be one of the files that i am interested in:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Abditibacterium_utsteinense/latest_assembly_versions/GCF_002973605.1_ASM297360v1/GCF_002973605.1_ASM297360v1_genomic.fna.gz
I used the command you mentioned in your post but it doesn't give me right files!
That is not the final command. As I said you will need to limit the URL's produced if you need just bacterial data.
You may be better off using the program @joe included below. There you can just say
ncbi-genome-download bacteria
to get all bacterial genomes.Thanks. I am using this. It gives me bunch of different directories. When i actually go into each directory, there is a file with content like below:
I am only interested in "./GCF_000007725.1_ASM772v1_genomic.fna.gz". Does anyone know how i can extract this and work with it separately?
You could use something like
find . -name "*.fna.gz"
and move those files to a new location and then delete the rest of the files if you don't want to keep them.The command you mentioned doesn't work unfortunately. I used this command "grep -w "GCF_000007365.1_ASM736v1_genomic.fna.gz" MD5SUMS.txt >> newfile/new.txt" to separate the fasta file but as a result the content of new.txt file would be something like below:
which isn't useful again as i want to work with this fasta file later on like decompress it...
For genomax's command to work you need to be working in a directory at the top of the tree.
find
is absolutely one of the best ways to do what you want, so you'll need to give us more info about what didn't work.It worked thanks for your comment about on the top of tree! But how i can decompress them when they are in the format like below:
Are you familiar with genomax's use of the
.
syntax with find? It's shorthand formy current directory
.More fully that command would look like:
Easy to find out with google:
https://www.cyberciti.biz/faq/howto-compress-expand-gz-files/
Hi jrj.healey, I noticed this command 'find . -name ".fna.gz" ' is not working at all. Even if i changed '.' to my current path where all the downloaded bacteria directories are, no result when running the command. Even if i was at the top of directory, it gives me nothing! Earlier that i commented it worked, it was my mistake. I had some other fasta bacteria in a different directory and when i used 'find . -name ".fna.gz" ' those fasta files showed up. I deleted those to confirm if this command is working or not. It turned out when i downloaded the bacteria directories with this ncbi command, the find command in any of the format i have used is not working. Any idea to solve this issue?