Hi,
Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)?
Ive tried R's genbankr's readGenBank function and biofile's gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.
My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).
genbankr does have a faster parsing function: parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.
There is probably a cool EntrezDirect answer for this but for now you should look around on the RefSeq Functional Elements page to see if you may be able to download an interesting file that can get you partway to what you need.