How To Retrieve A Gff With Sequence Identifier Different Than Gb With Bioperl
Entering edit mode
11.2 years ago
fbrundu ▴ 350

Hi all,

I need to retrieve a GFF with a specific accession number. Searching through a file in FASTA format I have:

>gi|345090966|ref|NG_029839.1|:195425-195447 Homo sapiens c-Maf inducing protein (CMIP), RefSeqGene on chromosome 16
>gi|343168829|gb|AC245437.1|:21551-21573 Homo sapiens FOSMID clone ABC14-947514C10 from chromosome unknown, complete sequence
>gi|340523118|ref|NG_029471.1|:35678-35700 Homo sapiens hemopoietic cell kinase (HCK), RefSeqGene on chromosome 20

I am able to retrieve the GFF related to lines with gb as sequence identifier, such as:

>gi|343168829|gb|AC245437.1|:21551-21573 Homo sapiens FOSMID clone ABC14-947514C10 from chromosome unknown, complete sequence

with a BioPerl script, in this way:

./ -accession AC245437.1 -stdout > AC245437.1.gff

I am unable to get gff with different sequence identifier. What am I doing wrong?


bioperl gff gff3 • 3.6k views
Entering edit mode
11.2 years ago

The gb symbol stands for GenBank, while the ref symbol stands for RefSeq. From the script's usage statement, it appears only GenBank accessions are supported for remote download.

[standage@lappy ~] 

Usage: [options] [<gff file 1> <gff file 2>] ...
Load a Bio::DB::GFF database from GFF files.

   --create                 Force creation and initialization of database
   --dsn       <dsn>        Data source (default dbi:mysql:test)
   --user      <user>       Username for mysql authentication
   --pass      <password>   Password for mysql authentication
   --proxy     <proxy>      Proxy server to use for remote access
   --stdout                 direct output to STDOUT
   --adaptor   <adaptor>    adaptor to use (eg dbi::mysql, dbi::pg, dbi::oracle)
   --viral                  the genome you are loading is viral (changes tag
   --source    <source>     source field for features ['genbank']
    EITHER --file           Arguments that follow are Genbank/EMBL file names
    OR --gb_folder          What follows is a folder full of gb files to process
    OR --accession          Arguments that follow are genbank accession numbers
                                 (not gi!)
    OR --acc_file           Accession numbers (not gi!) in a file (one per line,
                                 no punc.) 
    OR --acc_pipe           Accession numbers (not gi!) from a STDIN pipe (one
                                 per line)   

This script loads a Bio::DB::GFF database with the features contained
in a either a local genbank file or an accession that is fetched from
genbank.  Various command-line options allow you to control which
database to load and whether to allow an existing database to be

[standage@lappy ~]

However, if you have a GenBank formatted file downloaded to your local machine, you can use this script to extract all of the features in that file and convert to GFF3 format.

Unfortunately, your question does not make it very clear what information you need. I ran the example you gave, and the GFF3 that was generated contained only two (redundant and uninformative) features and the fosmid sequence. Perhaps it would make it easier for us to help you if you could edit your question to make it clearer what exactly you're looking for.

Entering edit mode

Thanks for your effort. I am a little confused.. Is it possible to extract more features that in this case aren't displayed? I have to extract all of them. Also, I need to retrieve informations from RefSeq too; do you know how to do it. Thanks again

Entering edit mode

I have done some research and I found this that explains how to do the conversion.. Anyway with this script, as you pointed out, there's no chance to download from RefSeq accession number.


Login before adding your answer.

Traffic: 3281 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6