Now that NCBI has given up assigning sub species level taxonomy ids (taxids), I am revisiting our method of building genomes from the *.gbk files at ftp.ncbi.nih.gov/genomes/Bacteria In the past I hoped that taxids would eventually allow exact matching of sequence to the source organism, but this won't happen now.
So I'm looking for a new method. The files in the directory are organized by bioproject_id (as a suffix to one organism within the project). In this directory of completed genomes, most projects refer to one organism, but some have multiple strains of the same species, and some have multiple organisms found in the same sample.
I thought I could match either the organism and/or the strain fields of the source feature (the first Feature), despite the pitfalls of matching text fields. But I see examples in which these fields differ slightly, even though they refer to the same organism:
Haloquadratum_walsbyi_C23_uid162019/NC_017457.gbk: /strain="DSM 16854" -- (a plasmid)
Haloquadratum_walsbyi_C23_uid162019/NC_017459.gbk: /strain="DSM 16854 = C23" -- (a chromosome)
The upshot seems to be that there is no unique key to identify source organisms. Am I wrong?
I'll probably write some code to identify these situations and manually sort them out, but ugh.
Any other suggestions to group the *.gbk files in a single directory by source organism? Ideally a solution would work for the more chaotic Bacteria_DRAFT ftp directory as well.