Fast parsing genbank files
1
0
Entering edit mode
8 days ago
rubic ▴ 210

Hi,

Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)? Ive tried R's genbankr's readGenBank function and biofile's gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.

My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).

genbankr does have a faster parsing function: parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.

Genbank parsing R • 273 views
0
Entering edit mode

There is probably a cool EntrezDirect answer for this but for now you should look around on the RefSeq Functional Elements page to see if you may be able to download an interesting file that can get you partway to what you need.

2
Entering edit mode
8 days ago

The biopython GenBank parser is as good as I have seen in high-level programming languages. It can parse the human chromosome 1 in GenBank format in about 15 seconds.

The file clocks in at about 350 MB and has almost 5 million lines.

\$ wc -l NC_000001.gb
4937618 NC_000001.gb


and

-rw-r--r-- 1 ialbert ialbert 347M Apr  7 19:52 NC_000001.gb


What I would recommend is to parse the GenBank into a JSON file as I do below with the bio package, once you do that the resulting JSON file can be loaded on demand into a full data structure in less than 5 seconds!

time bio view NC_000001.gb --json > out.json

real    0m18.480s


for details on how that conversion takes place see this:

https://github.com/ialbert/bio/blob/master/biorun/models/jsonrec.py#L503

0
Entering edit mode

Thanks a lot @Istvan Albert.

I installed the bio package (following this documentation) but then running

bio view <my_file.gb> --json > <my_out>.json

I get:

bio: making bioinformatics fun again

Valid commands:

bio data    : list or rename data
bio align   : performs sequence alignments
bio taxon   : displays NCBI taxonomies
bio define  : explains biological terms
bio convert : converts data to different formats
bio runinfo : prints sequencing run information


So it seems that the view option was not included in the installation.

Any idea?

0
Entering edit mode

I guess it should be convert instead of view according to the documentation:

bio convert <my_file.gb> --json > <my_out>.json

But when I run this command on one of the protein genbank files (vertebrate_mammalian.10.protein.gpff.gz - gunzipped of course), I get this error:

*** file format not recognized: vertebrate_mammalian.10.protein.gpff

And the same happens for vertebrate_mammalian.10.genomic.gpff

1
Entering edit mode

at this time the tool recognizes the format by file extension, the extension should be gb.gz, the conversion takes about 30 seconds and you can run it on the gziped file directly:

time bio convert vertebrate_mammalian.10.protein.gb.gz --json > out.json

real    0m32.184s


PS I guess I forgot to push the latest update (view/convert are the same commands). Maybe I should keep calling it convert. I keep going back and forth.