Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)?
readGenBank function and
gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.
My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).
genbankr does have a faster parsing function:
parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.