From GCA identifiers, download Genbank file format?
2
0
Entering edit mode
2.6 years ago
MSRS ▴ 590

Hi, I have found a number of posts about downloading files from NCBI. From a post, I found CLI tools but it can only download fasta, gff3, protein format files from GCA identifier GenBank acc. list (GCA_001874685.1, GCA_021460555.1), but not GenBank file.

Is there any way to download the full genebank file from GenBank accession list (Assembly)?

GCA_001874685.1
GCA_021460555.1
GCA_001874915.1

Thanks in advance

NCBI GenBank • 1.8k views
ADD COMMENT
3
Entering edit mode
2.6 years ago
vkkodali_ncbi ★ 3.8k

NCBI Datasets and the associated command line tool datasets can be used to download GenBank flat files for a GCA accession. It is not a default setting, so you need to add it to the command line as shown below:

datasets download genome accession GCA_001874685.1 --include-gbff
ADD COMMENT
0
Entering edit mode

Thank you so much.

ADD REPLY
2
Entering edit mode
2.6 years ago

Hi, the NCBI provides ftp access to required files with directory structure based on the accession numbers.

FTP method

e.g. files for GCA_001874685.1 is stored in ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/ and among them is a GCA_001874685.1_ASM187468v1_genomic.gbff.gz.

So the only bit of information that you don't know is the '_ASM...' part. You can now look inside the .../685 directory and download only the gbff.gz file from the directory starting with GCA_001874685.1. This can be done with some ftp client (or NCBI's aspera download utility).

entrez method

esearch -query GCA_001874685.1 -db assembly | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1

now you may do the wget

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/*.gbff.gz

# if you want e.g. only the genomic gbff then "*_genomic.gbff.gz" will do the trick

entrez method 2

# of course you can use entrez more, so smth like this will work
esearch -query GCA_001874685.1 -db assembly | elink -target nuccore | efetch -format gb

# but note, that you've received records from RefSeq instead if GenBank (for which you have accession).
# I don't know from the top of my head how to filter the RefSeq records out.
ADD COMMENT
0
Entering edit mode

Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 2501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6