Batch download genome feature file by RefSeq
3
1
Entering edit mode
3.2 years ago
ben83 ▴ 50

I have a list of hundreds or thousands of RefSeqs, e.g. NC_000964.1

If I point my browser to https://www.ncbi.nlm.nih.gov/nuccore/NC_000964.1, I see the gbff file displayed and I can download it through various clicks. What I can't figure out is how to automate this, how to construct some url or ftp address I can just use wget or rsync on. Any advice?

Edit: the specific file I'm after is the one called GenBank(full) in the drop-down:

screenshot

gbff ncbi refseq • 3.9k views
ADD COMMENT
1
Entering edit mode
3.2 years ago
vkkodali_ncbi ★ 3.7k

NCBI Datasets is designed specifically for something like this! You can query the bacterial genomes by taxids if you would like and download all of the data in one package. There is a command line tool that you can use on Windows, Mac or Linux machines to download sequence and annotation data starting from either taxids or accessions.

At this time, NCBI Datasets includes only the latest assemblies (see here). If you have a list of NCBI assembly accessions that are out of the scope of NCBI Datasets, you can download directly from the FTP paths as shown below. The NCBI Genomes FTP has a bunch of assembly_summary.txt files (this one, for example) that have the full FTP paths for assemblies that can be used with a tool like lftp to download GFF3 and GBFF files.

$ cat assm_list.txt 
GCF_009708215.1
GCF_000311725.1
GCF_004216875.1
GCF_009912635.1
GCF_001885905.1

$ grep -f assm_list.txt -w assembly_summary.txt \
    | cut -f1,20 | while read -r acc url ; do 
        echo -e "$(date) Downloading $acc" ; 
        lftp -e "mget ${url}/*.gff.gz ; exit" ; 
    done

Technically, you can use Entrez Direct to download the GenBank flatfiles for a given set of nucleotide accessions like NC_000913 but it does not scale very well if you have many thousands of accessions. Note, there is a limit on the number of requests you can make using e-utilities API which can be increased by getting an API key as described here.

ADD COMMENT
0
Entering edit mode

Thanks that looks promising. Unfortunately I can't get it to work, for example:

./datasets download genome accession NC_000913.3

Error: invalid or unsupported assembly accession: NC_000913.3

But that refseq certainly exists: https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3/

EDIT: OK I guess there is a difference between a RefSeq accession and an assembly accession? The following works

./datasets download genome accession GCF_000005845.2

I don't understand the difference between a refseq accession and a "assembly" accession, nor how to go from one to the other.

ADD REPLY
2
Entering edit mode

You can use Entrezdirect to go from one to other:

$ esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element RefSeq
GCF_000005845.2

or download this file (RefSeq) and parse it.

RefSeq accessions are manually curated.

ADD REPLY
0
Entering edit mode

Awesome. Thank you!

ADD REPLY
0
Entering edit mode

Hi, an update. Your piped chain of commands works for me, thank you, so I can proceed with that. For completeness, I cannot find refseq accessions in the file you linked to (ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt). For example, searching that file for NC_000913.3 or just NC_000913 returns no hits.

ADD REPLY
0
Entering edit mode

Sorry I forgot that these report files don't use the RefSeq ID's in NC* format. You could grep for the organism names in this file.

ADD REPLY
0
Entering edit mode

Thanks. I want to avoid doing that because the organism names get confusing fast with different strains of the same bacterial species (and the strains can be pretty different). I will stick to the Entrez Direct method. In case someone reads this later, NCBI's instructions on installing these utilities omit the simplest method, at least on Ubuntu 20 you can just do sudo apt install ncbi-entrez-direct.

ADD REPLY
0
Entering edit mode

Note, the version of Entrez Direct in Ubuntu repositories is typically behind the latest version. I recommend following the directions on the NCBI website.

ADD REPLY
0
Entering edit mode

refseq and assembly accessions are database accessions or unique identifiers in their own worlds. The refseq or nuccore accession is an accession for a particular sequence id, while the assembly accession can group multiple contigs, scaffolds, chromosomes, etc. In your case, with a near perfect assembly of a bacterial genome, the difference might feel semantic, however think about about for example eukaryotes.

ADD REPLY
0
Entering edit mode

Thanks that is helpful. When you say "The refseq or nuccore accession is an accession for a particular sequence id", can a "particular sequence id" only refer to single physically connected sequence? I.e., the refseq accession can be a genome in the bacterial case only because there is a single "chromosome" that contains the entire genome?

ADD REPLY
2
Entering edit mode

You can find information about RefSeq records in this FAQ. Accession prefixes for RefSeq records are in this table. An organism may have multiple NC* ID's for chromosomes (e.g. human genome ones are NC000001 through NC000024).

Key differences between GenBank (GCA) and RefSeq (GCF) assemblies are summarized on this page.

ADD REPLY
0
Entering edit mode

OK so I thought I was going from RefSeq to GenBank by converting from e.g. NC_000913.1 to GCF_000005845.2 but apparently GCF is RefSeq as well so everything is in RefSeq land. I guess by going from NC_ to GCF_ I am going from "Complete genomic molecule, usually reference assembly" to "RefSeq genome assembly".

ADD REPLY
0
Entering edit mode

If you wanted to go to Genbank then you should do:

$ esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Genbank
GCA_000005845.2
ADD REPLY
0
Entering edit mode

Thanks, I think GCF is fine for me. I just want to be able to download either .gb or .gff files given a NC_ accession, from the command line. So my current strategy is

  1. Take e.g. NC_000913.3
  2. Drop whatever is after the dot so it's NC_000913
  3. Do esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element RefSeq which yields GCF_000005845.2
  4. Do datasets download genome accession GCF_000005845.2 which produces a zip file containing fasta and GFF among other things
ADD REPLY
1
Entering edit mode
3.1 years ago
ben83 ▴ 50

I was ultimately successful in using a combination of @vkkodali 's suggestion to use the NCBI datasets tool and @GenoMax's suggested use of EntrezDirect tools to autmate go from an NC_* style annotation to a GCF_* annotation, and then download fasta and gff. A big thanks to both from me.

For an accession such as NC_000913.3, I did esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element RefSeq (note the dropping of whatever is after the dot) which yielded e.g. GCF_000005845.2. Then datasets download genome accession GCF_000005845.2 would download a file ncbi_dataset.zip containing what I needed. This didn't always work but out of about 3400 accessions it definitely worked the large majority of the time, which is all I needed.

There might have been more direct ways. For a given NC_* accession, it seems like pointing a browser to e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3/ works, so I considered trying to automate clicking through "Send to", maybe with something like AutoHotKey, but ultimately decided what I had here was good enough.

ADD COMMENT
0
Entering edit mode

You can get the GenBank format file by simply using

$ efetch -db nuccore -id "NC_000913" -format gb | head -5
LOCUS       NC_000913            4641652 bp    DNA     circular CON 11-OCT-2018
DEFINITION  Escherichia coli str. K-12 substr. MG1655, complete genome.
ACCESSION   NC_000913
VERSION     NC_000913.3
DBLINK      BioProject: PRJNA57779

What you can't get is a GFF file.

You could get the feature table and make your own GFF file.

$ efetch -db nuccore -id "NC_000913" -format ft | head -50
>Feature ref|NC_000913.3|
190 255 gene
            gene    thrL
            gene_syn    ECK0001
            locus_tag   b0001
            db_xref ASAP:ABE-0000006
            db_xref ECOCYC:EG11277
            db_xref EcoGene:EG11277
            db_xref GeneID:944742
190 255 CDS
            product thr operon leader peptide
            transl_table    11
            protein_id  ref|NP_414542.1|
            transcript_id   gnl|b0001|mrna.b0001
            db_xref UniProtKB/Swiss-Prot:P0AD86
ADD REPLY
0
Entering edit mode

Oh wow. Would probably do it that way if doing it again. Oh well!

ADD REPLY
0
Entering edit mode

I updated my answer to include another method to download directly from the FTP. You can download GFF3 files also using this method.

ADD REPLY
0
Entering edit mode

Thanks. Unless I'm mistaken, your updated answer's FTP method requires a GCF_* accession to start. But I'm starting from a list of NC_* (or NZ_*) accessions (http://tubic.org/doric/public/index.php/index/browse/bacteria?page=1).

ADD REPLY
0
Entering edit mode

Yes, you would have to get the GCF accessions for the nucleotide accessions using elink. Entrez Direct works fine for this because you are not downloading entire records but only a summary. In fact, you don't even have to extract the GCF accessions list if you don't care about them. There is an FtpPath_RefSeq attribute in the output of esummary that you can extract and use that to download files from FTP.

ADD REPLY
0
Entering edit mode
3.2 years ago
GenoMax 141k

Are these all genome accessions? Using ncbi-genome-download (LINK) would likely be the easiest option.

ADD COMMENT
0
Entering edit mode

Thanks that looks useful. Yes they are all accessions

Looking through the example syntax for the tool, it unfortunately doesn't look like one can pass it accession numbers. I guess I could just download all bacterial genomes and filter afterwards...

ADD REPLY
0
Entering edit mode

Do you understand the difference between

"To download only completed bacterial RefSeq genomes in GenBank format, run:

ncbi-genome-download --assembly-levels complete bacteria"

and

"To download only bacterial reference genomes from RefSeq in GenBank format, run:

ncbi-genome-download --refseq-categories reference bacteria"

is one a subset of the other?

ADD REPLY

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6