Retrieve GFF3 file from ncbi
4
2
Entering edit mode
4.4 years ago
john ▴ 80

I want to download the annotation file in gff3 format for the corresponding genome. As this fairly easy on the ncbi-webpage I don't find a possibility to do the same with efetch or the kind.

I hoped I could use something like this:

esearch -db nuccore -query "$genome_id" | efetch -format gff3 > "$path_data/$genome_id.gff"  gff3 ncbi efetch • 7.3k views ADD COMMENT 2 Entering edit mode 4.4 years ago tdmurphy ▴ 190 There are a couple of strategies you can try, depending on what you mean by$genome_id. In each case, it's a matter of finding the right FTP path, and then using wget to get the *genomic.gff.gz file in that path:

1. If you have assembly accessions, you can get FTP paths for each from the assembly_summary.txt file, and loop through them with wget. See Download All The Bacterial Genomes From Ncbi for a good post on the approach
2. If you have nucleotide sequence accessions for chromosomes, you can use esearch to directly query the Assembly database, and get the FTP path from the document summary:

esearch -db assembly -query NC_000913.3 | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

3. If you have nucleotide sequence accessions that don't directly work for queries in the Assembly database (e.g. contigs or scaffolds), you can query in nucleotide first and link to assembly:

esearch -db nuccore -query NZ_GL379776.1 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

0
Entering edit mode

number 2. is what i looked for. thanks

1
Entering edit mode
4.4 years ago

I don't think NBCI offers GFF formatted files through efetch (yet). Probably the best you can do is either indeed do it manually on the website (if you don't have many to do) or efetch genbank format and convert that to gff.

Otherwise, depending on the organism(s) you look for, there might be 'dedicated' databases that offer direct gff download.

0
Entering edit mode

thanks I try to convert it

12
Entering edit mode
4.2 years ago
ucpete ▴ 120

Unfortunately, GFF3 still hasn't been added to NCBI's E-utilities as a valid return type, despite having been added to the web tool a year or more ago. That said, we can take advantage of the web-based GFF retrieval tool directly – after inspecting network traffic while pulling GFFs from the NCBI web portal and playing around with the parameters, I was able to reverse engineer how to retrieve a GFF file given an accession number. The results can be retrieved using your favorite file retrieval tool (wget, cURL, etc.). Here's how I do it using wget:

wget -O /path/to/your.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=<acc[.ver]>"


[<acc.ver> in the example query string above should be replaced with your accession.version or accession, e.g. KC145265.1.]

N.B.: It's relatively straightforward to pull multiple GFFs from separate entries using a comma-separated list of identifiers, but I haven't stress tested this, nor have I slammed NCBI with so many queries that NCBI would feel compelled to block this type of web request. Here's a multi-identifier example:

wget -O Human_picobirnavirus.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_007026.1,NC_007027.1"

0
Entering edit mode

Nice Solution to this problem!

0
Entering edit mode
2.2 years ago

To download the GFF files in Batch, prepare a list of accession numbers. Got to Batch Entrez. From dropdown menu choose "Assembly". Upload the accession number list and search. To retrieve GFFs click on the "Download Assemblies" and choose filetype gff. This will download gff files separately zipped for each accession number. Now since the files comes with their project names and you wish the gff with the in accession_name.gff format here is a simple trick. List all the unzipped files in a list.txt file and use the following code.

while read p; do name=$(head -n 8$p | tail -1 | cut -f 1 ); mv $p${name}.gff; done < list.txt