Retrieve GFF3 file from ncbi
4
2
Entering edit mode
3.4 years ago
john ▴ 70

I want to download the annotation file in gff3 format for the corresponding genome. As this fairly easy on the ncbi-webpage I don't find a possibility to do the same with efetch or the kind.

I hoped I could use something like this:

esearch -db nuccore -query "$genome_id" | efetch -format gff3  > "$path_data/$genome_id.gff"
gff3 ncbi efetch • 4.6k views
ADD COMMENT
2
Entering edit mode
3.4 years ago
tdmurphy ▴ 190

There are a couple of strategies you can try, depending on what you mean by $genome_id. In each case, it's a matter of finding the right FTP path, and then using wget to get the *genomic.gff.gz file in that path:

  1. If you have assembly accessions, you can get FTP paths for each from the assembly_summary.txt file, and loop through them with wget. See Download All The Bacterial Genomes From Ncbi for a good post on the approach
  2. If you have nucleotide sequence accessions for chromosomes, you can use esearch to directly query the Assembly database, and get the FTP path from the document summary:

    esearch -db assembly -query NC_000913.3 | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

  3. If you have nucleotide sequence accessions that don't directly work for queries in the Assembly database (e.g. contigs or scaffolds), you can query in nucleotide first and link to assembly:

    esearch -db nuccore -query NZ_GL379776.1 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

ADD COMMENT
0
Entering edit mode

number 2. is what i looked for. thanks

ADD REPLY
1
Entering edit mode
3.4 years ago

I don't think NBCI offers GFF formatted files through efetch (yet). Probably the best you can do is either indeed do it manually on the website (if you don't have many to do) or efetch genbank format and convert that to gff.

Otherwise, depending on the organism(s) you look for, there might be 'dedicated' databases that offer direct gff download.

ADD COMMENT
0
Entering edit mode

thanks I try to convert it

ADD REPLY
7
Entering edit mode
3.2 years ago
ucpete ▴ 70

Unfortunately, GFF3 still hasn't been added to NCBI's E-utilities as a valid return type, despite having been added to the web tool a year or more ago. That said, we can take advantage of the web-based GFF retrieval tool directly – after inspecting network traffic while pulling GFFs from the NCBI web portal and playing around with the parameters, I was able to reverse engineer how to retrieve a GFF file given an accession number. The results can be retrieved using your favorite file retrieval tool (wget, cURL, etc.). Here's how I do it using wget:

wget -O /path/to/your.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=<acc[.ver]>"

[<acc.ver> in the example query string above should be replaced with your accession.version or accession, e.g. KC145265.1.]

N.B.: It's relatively straightforward to pull multiple GFFs from separate entries using a comma-separated list of identifiers, but I haven't stress tested this, nor have I slammed NCBI with so many queries that NCBI would feel compelled to block this type of web request. Here's a multi-identifier example:

wget -O Human_picobirnavirus.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_007026.1,NC_007027.1"
ADD COMMENT
0
Entering edit mode

Nice Solution to this problem!

ADD REPLY
0
Entering edit mode
13 months ago

To download the GFF files in Batch, prepare a list of accession numbers. Got to Batch Entrez. From dropdown menu choose "Assembly". Upload the accession number list and search. To retrieve GFFs click on the "Download Assemblies" and choose filetype gff. This will download gff files separately zipped for each accession number. Now since the files comes with their project names and you wish the gff with the in accession_name.gff format here is a simple trick. List all the unzipped files in a list.txt file and use the following code.

while read p; do name=$(head -n 8 $p | tail -1 | cut -f 1 ); mv $p ${name}.gff; done < list.txt

Tada!! here you have your GFF3 files in your desired name format.

ADD COMMENT

Login before adding your answer.

Traffic: 2284 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6