Question: Converting assembled genomes (.fna) and predicted genes (.faa) to genbank or gff file?
0
gravatar for bioguy
29 days ago by
bioguy40
bioguy40 wrote:

Anyone have a simple command line tool that can take 1) assembled bacterial genomes (fasta contigs) and 2) predicted gene (fasta ORFs) sequences from said genomes to generate either gff or genbank formatted files?

ADD COMMENTlink written 29 days ago by bioguy40
1

What you're asking for isn't really a 'conversion'. What you've described is annotation, which is an analysis task all of its own, unless you know definitively all your genes have specific matches to the genome of interest.

One of the best options (IMO) is prokka. If you have a list of proteins you already trust, you can pass that file to prokka and it will begin annotation from there.

ADD REPLYlink written 29 days ago by Joe14k

Right so that's kind of the issue – this output came from my own fork of prokka where I stripped out some of the extra stuff that, at the time, I didn't need. On top of that, for space, I had to delete gff files (I know that was a mistake, it's a long story).

Anyway, I now have a little side project that requires gff or gbk inputs, and because Ii'm a bit too lazy to rerun prokka a few hundred thousand times, I'm looking for way to turn what I have into what I need.

Sounds like I'll probably just need to redo it, though, if the predicted genes and assembled contigs can't be turned into a gff (which would make sense, as I guess getting gene coordinates would be night impossible).

ADD REPLYlink modified 29 days ago • written 29 days ago by bioguy40

If you can provide protein sequences which cover the majority of the detected CDSs when running prokka, it should be able to run quite quickly as it normally iterates over the CDSs applying progressively looser and looser matches until all the CDSs have a match (or are otherwise "hypothetical proteins") - I would expect, at least.

I'm not aware of a simple 'lift over' tool myself. You could roll one with BioPython or something, but as you say, you either need coordinates or need to go through an alignment process, and if you have to realign hundreds of thousands of proteins to hundreds of thousands of genomes, you're probably just as well re-running prokka.

ADD REPLYlink written 29 days ago by Joe14k

Got it, makes sense. Thanks for the advice – I like that first idea, but honestly it'll probably be less of a headache to just rerun it.

ADD REPLYlink written 29 days ago by bioguy40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1507 users visited in the last hour