Question

How To Retrieve A Gff With Sequence Identifier Different Than Gb With Bioperl

0

Entering edit mode

12.2 years ago

fbrundu ▴ 350

Hi all,

I need to retrieve a GFF with a specific accession number. Searching through a file in FASTA format I have:

>gi|345090966|ref|NG_029839.1|:195425-195447 Homo sapiens c-Maf inducing protein (CMIP), RefSeqGene on chromosome 16
TGCAAAAGTAATTGCAGTTTTTG
>gi|343168829|gb|AC245437.1|:21551-21573 Homo sapiens FOSMID clone ABC14-947514C10 from chromosome unknown, complete sequence
CAAAAACTGCAATTACTTTTGCA
>gi|340523118|ref|NG_029471.1|:35678-35700 Homo sapiens hemopoietic cell kinase (HCK), RefSeqGene on chromosome 20
CAAAAACTGCAATTACTTTTGCA

I am able to retrieve the GFF related to lines with gb as sequence identifier, such as:

>gi|343168829|gb|AC245437.1|:21551-21573 Homo sapiens FOSMID clone ABC14-947514C10 from chromosome unknown, complete sequence
CAAAAACTGCAATTACTTTTGCA

with a BioPerl script, in this way:

./bp_genbank2gff.pl -accession AC245437.1 -stdout > AC245437.1.gff

I am unable to get gff with different sequence identifier. What am I doing wrong?

Thanks

bioperl gff gff3 • 4.1k views

ADD COMMENT • link updated 12.2 years ago by Daniel Standage 4.1k • written 12.2 years ago by fbrundu ▴ 350

score 2 · Answer 1 · 2013-04-27

The gb symbol stands for GenBank, while the ref symbol stands for RefSeq. From the script's usage statement, it appears only GenBank accessions are supported for remote download.

[standage@lappy ~] bp_genbank2gff.pl 

Usage: bp_genbank2gff.pl [options] [<gff file 1> <gff file 2>] ...
Load a Bio::DB::GFF database from GFF files.

 Options:
   --create                 Force creation and initialization of database
   --dsn       <dsn>        Data source (default dbi:mysql:test)
   --user      <user>       Username for mysql authentication
   --pass      <password>   Password for mysql authentication
   --proxy     <proxy>      Proxy server to use for remote access
   --stdout                 direct output to STDOUT
   --adaptor   <adaptor>    adaptor to use (eg dbi::mysql, dbi::pg, dbi::oracle)
   --viral                  the genome you are loading is viral (changes tag
                                 choices)
   --source    <source>     source field for features ['genbank']
    EITHER --file           Arguments that follow are Genbank/EMBL file names
    OR --gb_folder          What follows is a folder full of gb files to process
    OR --accession          Arguments that follow are genbank accession numbers
                                 (not gi!)
    OR --acc_file           Accession numbers (not gi!) in a file (one per line,
                                 no punc.) 
    OR --acc_pipe           Accession numbers (not gi!) from a STDIN pipe (one
                                 per line)   


This script loads a Bio::DB::GFF database with the features contained
in a either a local genbank file or an accession that is fetched from
genbank.  Various command-line options allow you to control which
database to load and whether to allow an existing database to be
overwritten.

[standage@lappy ~]

However, if you have a GenBank formatted file downloaded to your local machine, you can use this script to extract all of the features in that file and convert to GFF3 format.

Unfortunately, your question does not make it very clear what information you need. I ran the example you gave, and the GFF3 that was generated contained only two (redundant and uninformative) features and the fosmid sequence. Perhaps it would make it easier for us to help you if you could edit your question to make it clearer what exactly you're looking for.