Question

How To Unequivocally Assign Specific Organism To *.Gbk Files From Ncbi Bacteria Genomes

1

Entering edit mode

10.1 years ago

Richard Llewellyn ▴ 180

Now that NCBI has given up assigning sub species level taxonomy ids (taxids), I am revisiting our method of building genomes from the *.gbk files at ftp.ncbi.nih.gov/genomes/Bacteria In the past I hoped that taxids would eventually allow exact matching of sequence to the source organism, but this won't happen now.

So I'm looking for a new method. The files in the directory are organized by bioproject_id (as a suffix to one organism within the project). In this directory of completed genomes, most projects refer to one organism, but some have multiple strains of the same species, and some have multiple organisms found in the same sample.

I thought I could match either the organism and/or the strain fields of the source feature (the first Feature), despite the pitfalls of matching text fields. But I see examples in which these fields differ slightly, even though they refer to the same organism:

eg:

Haloquadratum_walsbyi_C23_uid162019/NC_017457.gbk: /strain="DSM 16854" -- (a plasmid)

Haloquadratum_walsbyi_C23_uid162019/NC_017459.gbk: /strain="DSM 16854 = C23" -- (a chromosome)

The upshot seems to be that there is no unique key to identify source organisms. Am I wrong?

I'll probably write some code to identify these situations and manually sort them out, but ugh.

Any other suggestions to group the *.gbk files in a single directory by source organism? Ideally a solution would work for the more chaotic Bacteria_DRAFT ftp directory as well.

parsing ncbi • 4.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.1 years ago by Richard Llewellyn ▴ 180

score 1 · Answer 1 · 2014-04-03

1

Entering edit mode

10.1 years ago

pld 5.1k

Each .gbk should have a source definition that will specify if it is a plasmid or not.

Using your examples Plasmid (NC_017457.gbk):

source          1..100258
                 /organism="Haloquadratum walsbyi C23"
                 /mol_type="genomic DNA"
                 /strain="DSM 16854"
                 /db_xref="taxon:768065"
                 /plasmid="PL100"

Genome(NC_017459.gbk):

source          1..3148033
                 /organism="Haloquadratum walsbyi C23"
                 /mol_type="genomic DNA"
                 /strain="DSM 16854 = C23"
                 /db_xref="taxon:768065"

So if a given file lacks a /plasmid tag you can filter on that. The DEFINITION entry for a plasmid should also have the word "plasmid" in it, giving you another point to filter under. If these fail then you could filter for the longest sequence, making the assumption that a bacterial genome will always be longer than a plasmid.

ADD COMMENT • link 10.1 years ago by pld 5.1k

0

Entering edit mode

Thanks for that thought -- I agree the /plasmid tag can be useful.. By an organism's genome, I mean all the DNA carried by an organism, plasmids, chromosomes, and prophages, so I want to match all of these gbk files to the organism (host if you prefer) from which they were sequenced. Separately, I agree, sometimes it is unclear by source tags which is a chromosome (or 'complete genome').

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

0

Entering edit mode

So what is the problem then? Each folder contains only the information for a single strain.

The DSM 16854 = C23 entry under /strain is simply saying that they're equivalent names for the same strain. C23 is the name given to the strain, DSM16854 is an identifier given to strain C23 by the DSMZ Bacteria Collection. They both point to the same thing.

https://www.dsmz.de/catalogues/details/culture/DSM-16854.html

ADD REPLY • link 10.1 years ago by pld 5.1k

0

Entering edit mode

Nope, that's what I had hoped (and then it would be simple), but folders sometimes contain multiple strains, or even multiple unrelated organisms, such as:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Vibrio_parahaemolyticus_O1_K33_CDC_K4557_uid212977

and yes, in the example it is easy enough to read the differing strain info for the chromosome and plasmid, and understand they are the same, but getting code to do that is not so easy...

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

0

Entering edit mode

You should be able to use taxid to handle the case of multiple species in a single directory. If the taxid for a file fails to match that of the directory species, toss it. I guess you could limit to one genome per folder. Parsing by organism name is probably more work than it's worth.

For handling strain naming problems there's no obvious option outside of downloading from all the databases and making a look up table. Even this might not provide for full confidence, it is clear that the strain annotations are inconsistently implemented.

I'm not sure that there is a single point to filter on that will provide you with what you need. I've never had a good experience filtering .gbk files, especially when pulling them from the bacteria ftp (it was a few years ago the last time I did). I've never been fully confident of whatever filtering kludge I worked up, I usually end up at reading through the data set in the end just to be sure. Unless you're writing software you plan to distribute, or you need frequent updates, I'd wager that you will save time by just manually checking the files.

ADD REPLY • link 10.1 years ago by pld 5.1k

0

Entering edit mode

Yeah, similar history here. One kludge after another. I find it disheartening that NCBI has given up on subspecies taxids, as there is no controlled vocabulary for these. I could toss files, but my goal is to parse all available prok genomes, and that Vibrio example may be a harbinger of what is to come -- a mix of species and strains all within one bioproject directory. I'm still hoping someone will have a solution we haven't thought of ...

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

0

Entering edit mode

http://jgi.doe.gov/ Has done a much better job with sequence annotation and curation, you might want to check there instead. However they do seem to be having some database issues currently.

ADD REPLY • link 10.1 years ago by pld 5.1k

0

Entering edit mode

Thanks. I've used them in the past, but need to be more current (want to have at least 95% of completed genomes already deposited in NCBI). I'll check there again though.

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

score 1 · Answer 2 · 2014-04-03

1

Entering edit mode

10.1 years ago

Neilfws 49k

I had a different, but related issue recently, where I wanted to identify which organisms in the FTP "Bacteria" directory are in fact Archaea, then download GBK files. My blog post has more details which might help with your problem.

I solved that issue by downloading a file containing the required information from this location:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

That file contains full organism names, taxids, bioproject ids and sequence ids. Here's the full list of column headers when read into R:

 [1] "Organism/Name"       "TaxID"                "BioProject Accession"
 [4] "BioProject ID"        "Group"                "SubGroup"            
 [7] "Size (Mb)"            "GC%"                  "Chromosomes/RefSeq"  
[10] "Chromosomes/INSDC"    "Plasmids/RefSeq"      "Plasmids/INSDC"      
[13] "WGS"                  "Scaffolds"            "Genes"               
[16] "Proteins"             "Release Date"         "Modify Date"         
[19] "Status"               "Center"               "BioSample Accession" 
[22] "Assembly Accession"   "Reference"            "FTP Path"

ADD COMMENT • link 10.1 years ago by Neilfws 49k

0

Entering edit mode

That file looks promising (+1 for the reminder and showing the headers). I'll parse it and see if it is complete and accurate enough to list all the molecules in the *.gbk files unequivocally with their source organisms.

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

1

Entering edit mode

I haven't given up on using the GENOME_REPORTS prokaryotes.txt file, but there is an added wrinkle -- the bioproject_ids are 'GenBank' ids, while the genomes in the ftp "Bacteria" are 'RefSeq' bioproject_ids, so I'll need to cross-ref using ftp://ftp.ncbi.nlm.nih.gov/bioproject/refseq-genbank.csv.

UPDATE: the prokaryotes.txt file is of little help for this endeavor: it seems as error prone as the Bacteria directory ('complete' genomes of bacteria composed only of plasmids or phages, missing genomes of composite directories, mix of refseq and genbank bioproject ids).

I've resorted to trying to match strain names, falling back on attempting to join gbk files to a chromosome (a long molecule) within a Bacteria directory when that fails. Not a real solution, so I'm leaving this question open.

ADD REPLY • link 10.1 years ago by Richard Llewellyn ▴ 180

Ram · Answer 3 · 2014-04-08

1

Entering edit mode

10.1 years ago

umer.zeeshan.ijaz ★ 1.8k

Speaking of extracting data from a folder containing GBK files, here are some one-liners to get specific information:

Best Wishes,
Umer

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.1 years ago by umer.zeeshan.ijaz ★ 1.8k