Question: Converting assembled genomes (.fna) and predicted genes (.faa) to genbank or gff file?
gravatar for bioguy
11 months ago by
bioguy50 wrote:

Anyone have a simple command line tool that can take 1) assembled bacterial genomes (fasta contigs) and 2) predicted gene (fasta ORFs) sequences from said genomes to generate either gff or genbank formatted files?

ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 11 months ago by bioguy50

What you're asking for isn't really a 'conversion'. What you've described is annotation, which is an analysis task all of its own, unless you know definitively all your genes have specific matches to the genome of interest.

One of the best options (IMO) is prokka. If you have a list of proteins you already trust, you can pass that file to prokka and it will begin annotation from there.

ADD REPLYlink written 11 months ago by Joe17k

Right so that's kind of the issue – this output came from my own fork of prokka where I stripped out some of the extra stuff that, at the time, I didn't need. On top of that, for space, I had to delete gff files (I know that was a mistake, it's a long story).

Anyway, I now have a little side project that requires gff or gbk inputs, and because Ii'm a bit too lazy to rerun prokka a few hundred thousand times, I'm looking for way to turn what I have into what I need.

Sounds like I'll probably just need to redo it, though, if the predicted genes and assembled contigs can't be turned into a gff (which would make sense, as I guess getting gene coordinates would be night impossible).

ADD REPLYlink modified 11 months ago • written 11 months ago by bioguy50

If you can provide protein sequences which cover the majority of the detected CDSs when running prokka, it should be able to run quite quickly as it normally iterates over the CDSs applying progressively looser and looser matches until all the CDSs have a match (or are otherwise "hypothetical proteins") - I would expect, at least.

I'm not aware of a simple 'lift over' tool myself. You could roll one with BioPython or something, but as you say, you either need coordinates or need to go through an alignment process, and if you have to realign hundreds of thousands of proteins to hundreds of thousands of genomes, you're probably just as well re-running prokka.

ADD REPLYlink written 11 months ago by Joe17k

Got it, makes sense. Thanks for the advice – I like that first idea, but honestly it'll probably be less of a headache to just rerun it.

ADD REPLYlink written 11 months ago by bioguy50

prokka does this using tbl2asn.

ADD REPLYlink written 10 months ago by Mensur Dlakic6.0k

Word of warning, tbl2asn is a major pain, requiring frequent updating, and will likely be removed in subsequent versions according to Torsten, so I wouldn't advise coming to rely on it.

ADD REPLYlink written 10 months ago by Joe17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 756 users visited in the last hour