Question

From GCA identifiers, download Genbank file format?

0

Entering edit mode

24 months ago

MSRS ▴ 580

Hi, I have found a number of posts about downloading files from NCBI. From a post, I found CLI tools but it can only download fasta, gff3, protein format files from GCA identifier GenBank acc. list (GCA_001874685.1, GCA_021460555.1), but not GenBank file.

Is there any way to download the full genebank file from GenBank accession list (Assembly)?

GCA_001874685.1
GCA_021460555.1
GCA_001874915.1

Thanks in advance

NCBI GenBank • 1.3k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 24 months ago by MSRS ▴ 580

score 3 · Accepted Answer · 2022-04-27

3

Entering edit mode

24 months ago

vkkodali_ncbi ★ 3.7k

NCBI Datasets and the associated command line tool datasets can be used to download GenBank flat files for a GCA accession. It is not a default setting, so you need to add it to the command line as shown below:

datasets download genome accession GCA_001874685.1 --include-gbff

ADD COMMENT • link 24 months ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thank you so much.

ADD REPLY • link 24 months ago by MSRS ▴ 580

score 2 · Accepted Answer · 2022-04-27

Hi, the NCBI provides ftp access to required files with directory structure based on the accession numbers.

FTP method

e.g. files for GCA_001874685.1 is stored in ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/ and among them is a GCA_001874685.1_ASM187468v1_genomic.gbff.gz.

So the only bit of information that you don't know is the '_ASM...' part. You can now look inside the .../685 directory and download only the gbff.gz file from the directory starting with GCA_001874685.1. This can be done with some ftp client (or NCBI's aspera download utility).

entrez method

esearch -query GCA_001874685.1 -db assembly | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1

now you may do the wget

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/*.gbff.gz

# if you want e.g. only the genomic gbff then "*_genomic.gbff.gz" will do the trick

entrez method 2

# of course you can use entrez more, so smth like this will work
esearch -query GCA_001874685.1 -db assembly | elink -target nuccore | efetch -format gb

# but note, that you've received records from RefSeq instead if GenBank (for which you have accession).
# I don't know from the top of my head how to filter the RefSeq records out.