Question

downloading genomes in fasta format from accession ids

0

Entering edit mode

17 months ago

Lior • 0

Hi all, I have a list of accession numbers (GCF/A) and I want to download their complete genomes from NCBI in fasta format. I saw a lot of recommendation to use the NCBI datasets and dataformat tools, is it really the best option? As far as I understand, I need to use datasets with:

datasets download genome accession {acc} --exclude-gff3 --exclude-protein --exclude-rna > outdir/{acc}.zip

to get a zipped folder with a lot of un-relevant data inside, is there another tool that I can use maybe in python to download fasta from accession number directly?

Also, I want to download the metadata aswell, if I use:

datasets summary genome accession {acc} > outdir/{acc}.json

I will also need to convert it with:

dataformat tsv genome --input-file outdir/{acc}.json > outdir/{acc}.tsv

Am I correct in thinking that there should be a way to do this with less conversion and deleting useless data? (like with the sratoolkit..)

Any help will be much appreciated!

dataformat fasta datasets ncbi genome • 3.0k views

ADD COMMENT • link updated 8 months ago by dokdonia • 0 • written 17 months ago by Lior • 0

1

Entering edit mode

You can try the Bio.Entrez package, which gives you access to the Entrez utilities that are traditionally invoked from the command line. Given an accession number $acc, the command to retrieve the corresponding fasta file is efetch -db nuccore -format fasta -id $acc

ADD REPLY • link 17 months ago by acvill ▴ 340

1

Entering edit mode

17 months ago

MirianT_NCBI ▴ 720

Hi, Thanks for giving a try with NCBI Datasets :) I hope I can help you a bit with your question!

So, we recently released a new version of datasets with a modified genome package: only the genome FASTA and data report are included. Based on user feedback, we also changed some of our flags to make things more concise. So, with the datasets v14 and up, your command would be:

datasets download genome accession {acc} > outdir/{acc}.zip

Metadata is already included in the data package, so there's no need to use the datasets summary to access it. You still can use dataformat to convert the information to tsv format. In this case, you have two options: use the file data_report.jsonl (with the flag --inputfile) that's included with each data package or use the flag --package and use the zip file as input.

Just one more thing: if that's useful for you, you can provide a list of accessions instead of downloading a separate data package for each one. Maybe that won't work for your pipeline, but I thought it might be helpful. Here's how to do it:

datasets download genome accession --inputfile list.txt > outdir/all_accessions.zip

I hope this helps!

ADD COMMENT • link 17 months ago by MirianT_NCBI ▴ 720

0

Entering edit mode

8 months ago

dokdonia • 0

I had the same question. However, it was very hard for me to find a simple tool. So, I have created a python file to get a ftp link from NCBI's GenBank by using accession numbers. If anyone wants to use this python tool, feel free to download it.

https://github.com/ryu1013/accession_to_genbank_link

ADD COMMENT • link 8 months ago by dokdonia • 0

score 4 · Accepted Answer · 2022-11-08

4

Entering edit mode

17 months ago

GenoMax 141k

Use the answer here to get just sequence data with datasets --> How to retrive genomes of the isolates from specific regions? For example, If I want to retrive all the Escherichia genome fasta files from NCBI which are submitted from USA.

~~I also have an Entrezdirect way of doing this in case you want to try it in the same thread. Replace the `taxID` with your `accession` in this answer --> https://www.biostars.org/p/9529874/#9529935~~

ADD COMMENT • link 17 months ago by GenoMax 141k

0

Entering edit mode

Thank you for the answer, perhaps you know how to get the fasta file? I tried:

esearch -db assembly -query "001310775 [taxID]" | efetch -format fasta
esearch -db assembly -query "001310775 [accession]" | efetch -format fasta
esearch -db assembly -query "GCF_001310775 [accession]" | efetch -format fasta

and many other commands but nothing works. or maybe you can refer me a manual for writing these queries because I haven't found one.

ADD REPLY • link 17 months ago by Lior • 0

1

Entering edit mode

My apologies. One can't directly retrieve the actual sequence data using EntrezDirect.

You can get the FTP paths (GCA* - GenBank version, GCF* - RefSeq version)

$ esearch -db assembly -query "GCF_001310775" | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank,FtpPath_RefSeq
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/310/775/GCA_001310775.1_ASM131077v1      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/310/775/GCF_001310775.1_ASM131077v1

Please replace ftp:// with http:// since most browsers no longer support FTP.

If you have accession numbers then also consider the tools mentioned by @shenwei --> How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?

ADD REPLY • link 17 months ago by GenoMax 141k

0

Entering edit mode

I can't see your previous comment for some reason but I tried using the esearch -> elink -> efetch line and its perfect!! Thanks a lot.

ADD REPLY • link 17 months ago by Lior • 0

0

Entering edit mode

Lior : For reference I had originally put (this retrieves entry from WGS)

$ esearch -db assembly -query "GCF_001310775" | elink -target nuccore | efetch -format fasta
>NZ_BAZI01000829.1 Stenotrophomonas pictorum JCM 9942, whole genome shotgun sequence
CTGGCAGCGTGGCGGCGTGCTGGCGCTGCTGTCGGAGCTGTGCGCCGAACAGGCCGAACGCCTGCTGGCG
CTGGTCGATGGCGAGCGCCGGCTGACCAACTACCTGCAGCTGGCCGAACAGCTGCAGGAGGCCAGTCACC
GCAGCATCGGCCTGCACGGCCTGCTGGACTGGCTGCAGACCCGGATCGCCCACGCCGACGAGGGCGACGA

I don't know if that is exactly the same sequence that is in GCA/F* accession. The entry above is from Whole Genome Shortgun database.

So please take that into consideration.

ADD REPLY • link 17 months ago by GenoMax 141k