Fast parsing genbank files
Entering edit mode
8 days ago
rubic ▴ 210


Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)? Ive tried R's genbankr's readGenBank function and biofile's gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.

My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).

genbankr does have a faster parsing function: parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.

Genbank parsing R • 273 views
Entering edit mode

There is probably a cool EntrezDirect answer for this but for now you should look around on the RefSeq Functional Elements page to see if you may be able to download an interesting file that can get you partway to what you need.

Entering edit mode
8 days ago

The biopython GenBank parser is as good as I have seen in high-level programming languages. It can parse the human chromosome 1 in GenBank format in about 15 seconds.

The file clocks in at about 350 MB and has almost 5 million lines.

$ wc -l


-rw-r--r-- 1 ialbert ialbert 347M Apr  7 19:52

What I would recommend is to parse the GenBank into a JSON file as I do below with the bio package, once you do that the resulting JSON file can be loaded on demand into a full data structure in less than 5 seconds!

time bio view --json > out.json

real    0m18.480s

for details on how that conversion takes place see this:

Entering edit mode

Thanks a lot @Istvan Albert.

I installed the bio package (following this documentation) but then running

bio view <> --json > <my_out>.json

I get:

bio: making bioinformatics fun again

Valid commands:

   bio data    : list or rename data
   bio fetch   : downloads data from repositories
   bio align   : performs sequence alignments
   bio taxon   : displays NCBI taxonomies
   bio define  : explains biological terms
   bio convert : converts data to different formats
   bio runinfo : prints sequencing run information

So it seems that the view option was not included in the installation.

Any idea?

Entering edit mode

I guess it should be convert instead of view according to the documentation:

bio convert <> --json > <my_out>.json

But when I run this command on one of the protein genbank files (vertebrate_mammalian.10.protein.gpff.gz - gunzipped of course), I get this error:

*** file format not recognized: vertebrate_mammalian.10.protein.gpff

And the same happens for vertebrate_mammalian.10.genomic.gpff

Entering edit mode

at this time the tool recognizes the format by file extension, the extension should be gb.gz, the conversion takes about 30 seconds and you can run it on the gziped file directly:

time bio convert --json > out.json

real    0m32.184s

PS I guess I forgot to push the latest update (view/convert are the same commands). Maybe I should keep calling it convert. I keep going back and forth.


Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6