Question: Retrieve GFF3 file from ncbi
2
gravatar for john
3.0 years ago by
john70
European Union
john70 wrote:

I want to download the annotation file in gff3 format for the corresponding genome. As this fairly easy on the ncbi-webpage I don't find a possibility to do the same with efetch or the kind.

I hoped I could use something like this:

esearch -db nuccore -query "$genome_id" | efetch -format gff3  > "$path_data/$genome_id.gff"
efetch ncbi gff3 • 3.9k views
ADD COMMENTlink modified 9 months ago by rohitsatyam102220 • written 3.0 years ago by john70
2
gravatar for tdmurphy
3.0 years ago by
tdmurphy190
tdmurphy190 wrote:

There are a couple of strategies you can try, depending on what you mean by $genome_id. In each case, it's a matter of finding the right FTP path, and then using wget to get the *genomic.gff.gz file in that path:

  1. If you have assembly accessions, you can get FTP paths for each from the assembly_summary.txt file, and loop through them with wget. See Download All The Bacterial Genomes From Ncbi for a good post on the approach
  2. If you have nucleotide sequence accessions for chromosomes, you can use esearch to directly query the Assembly database, and get the FTP path from the document summary:

    esearch -db assembly -query NC_000913.3 | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

  3. If you have nucleotide sequence accessions that don't directly work for queries in the Assembly database (e.g. contigs or scaffolds), you can query in nucleotide first and link to assembly:

    esearch -db nuccore -query NZ_GL379776.1 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

ADD COMMENTlink written 3.0 years ago by tdmurphy190

number 2. is what i looked for. thanks

ADD REPLYlink written 3.0 years ago by john70
1
gravatar for lieven.sterck
3.0 years ago by
lieven.sterck9.5k
VIB, Ghent, Belgium
lieven.sterck9.5k wrote:

I don't think NBCI offers GFF formatted files through efetch (yet). Probably the best you can do is either indeed do it manually on the website (if you don't have many to do) or efetch genbank format and convert that to gff.

Otherwise, depending on the organism(s) you look for, there might be 'dedicated' databases that offer direct gff download.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by lieven.sterck9.5k

thanks I try to convert it

ADD REPLYlink written 3.0 years ago by john70
5
gravatar for ucpete
2.8 years ago by
ucpete50
San Francisco, CA
ucpete50 wrote:

Unfortunately, GFF3 still hasn't been added to NCBI's E-utilities as a valid return type, despite having been added to the web tool a year or more ago. That said, we can take advantage of the web-based GFF retrieval tool directly – after inspecting network traffic while pulling GFFs from the NCBI web portal and playing around with the parameters, I was able to reverse engineer how to retrieve a GFF file given an accession number. The results can be retrieved using your favorite file retrieval tool (wget, cURL, etc.). Here's how I do it using wget:

wget -O /path/to/your.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=<acc[.ver]>"

[<acc.ver> in the example query string above should be replaced with your accession.version or accession, e.g. KC145265.1.]

N.B.: It's relatively straightforward to pull multiple GFFs from separate entries using a comma-separated list of identifiers, but I haven't stress tested this, nor have I slammed NCBI with so many queries that NCBI would feel compelled to block this type of web request. Here's a multi-identifier example:

wget -O Human_picobirnavirus.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_007026.1,NC_007027.1"
ADD COMMENTlink written 2.8 years ago by ucpete50

Nice Solution to this problem!

ADD REPLYlink written 12 months ago by microfuge1.9k
0
gravatar for rohitsatyam102
9 months ago by
rohitsatyam102220 wrote:

To download the GFF files in Batch, prepare a list of accession numbers. Got to Batch Entrez. From dropdown menu choose "Assembly". Upload the accession number list and search. To retrieve GFFs click on the "Download Assemblies" and choose filetype gff. This will download gff files separately zipped for each accession number. Now since the files comes with their project names and you wish the gff with the in accession_name.gff format here is a simple trick. List all the unzipped files in a list.txt file and use the following code.

while read p; do name=$(head -n 8 $p | tail -1 | cut -f 1 ); mv $p ${name}.gff; done < list.txt

Tada!! here you have your GFF3 files in your desired name format.

ADD COMMENTlink modified 9 months ago • written 9 months ago by rohitsatyam102220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1948 users visited in the last hour
_