1
0
Entering edit mode
2.8 years ago

Good morning, folks. I hope you're all right, in short, I wanted to get what there is a package to telesearch 2333 genomes from the NCBI database.

thank you all

R sequence genome • 916 views
1
Entering edit mode

0
Entering edit mode

NCBI Datasets is a new NCBI resource designed specifically to address tasks like these. If you can describe a little what it is you are trying to download, I'd be able to help you more. What kind of genomes are these? Which file types are you interested in? And, what is your starting point -- a list of NCBI assembly accessions, species names, etc?

0
Entering edit mode

thank you for the help first, I'm working on the genome of a bacterium (clostridium difficile). I need all the genomes that have been deposited in NCBI to make a comparison with the ones we have. I wanted to download the assembled genomes.

0
Entering edit mode
2.8 years ago
vkkodali_ncbi ★ 3.4k

For Clostridium difficile, you can either use NCBI Datasets command line application or the API. There is a Python library to parse the assembly descriptions and navigate the directory hierarchy that is described in a more detail here as well as Jupyter Notebooks that can be run on binder.

For the purpose of this post, I will use the command-line application. Assuming you have followed the instructions from this page and downloaded the application, follow the commands shown below:

## download assembly descriptors and make a list of assembly accessions
## NCBI Taxonomy ID for Clostridium difficiles is 1496
$datasets assembly_descriptors tax_id 1496 -l 'ALL' | python -m json.tool > cdiff.json ## make a list of GCF accessions$ grep -o 'GC[AF]_[0-9]*\.[0-9]*' cdiff.json | sort -u > cdiff.accs



This will download a file ncbi_datasets.zip which will have the genome sequences for >3000 Clostridium species in FASTA format. There are additional options to restrict the list to RefSeq assemblies only in datasets assembly_descriptors command and additional file type options in the datasets download command that may be of interest to you. I suggest you take a quick look at the documentation and the help files.

0
Entering edit mode

thanks a lot vkkodali, how I can change the .accs extension to .zip to find the different genomes in .fasta format ? then I anchor prokka for the annotation in order to retrieve the different 16S and create the phylogenetic tree. if you have a helper thank you to explain me. I'll thank you some other time.

0
Entering edit mode

The file cdiff.accs is just a simple text file with a list of NCBI assembly accessions whose genomes you'd like to download. You can view this in any text editor such as Notepad. The downloaded data are in the ncbi_dataset.zip file. The query returns nearly 4000 assemblies so that is the number of FASTA files you have in the ncbi_dataset.zip archive. For prokka, do you need a single multi-fasta file with all of the genomes? On a Unix machine, if you want the former, you can use something like unzip -d cdiff_fasta/ ncbi_dataset.zip ncbi_dataset/data/GC*/GC*.fna to download individual fasta files to cdiff_fasta directory (you may have to create the directory first). If you want the latter, then you can use unzip -p ncbi_dataset.zip ncbi_dataset.zip ncbi_dataset/data/GC*/GC*.fna > cdiff.fasta

0
Entering edit mode

I need one fasta file for each genome. A total of 2,3333 .fasta files. It works fine thanks, but how can I extract the .fna files from each folder so that I can anchor prokka with a For loop.

0
Entering edit mode

I don't know how prokka works so I cannot go in to much detail. First, unzip the ncbi_dataset.zip archive. This will create a few files and a directory called ncbi_dataset. You can loop through all of the fasta files by doing something like for fa_file in ncbi_dataset/data/GC*/*.fna ; do ...

0
Entering edit mode

oky, thanks for the support

0
Entering edit mode

hello sir, please if you have any idea how I can extract the 16S ribosomal from all the clostridium strains to make a phylogeny tree on MEGA-X.