Question

Programs for bacterial gene prediction/annotation for nanopore assembly?

2

Entering edit mode

4.9 years ago

BioinformaticsLad ▴ 200

I've assembled a known E. coli genome only using nanopore reads. I want to check if the genes predicted matches the ground truth. I tried Prokka but it predicts way too many CDS (~10000). Although most of them are 'hypothetical proteins' so I'm guessing those are false positive ORFs caused by frameshifts as a result of nanopore's systemic indel errors.

Are there programs that work well specifically with nanopore erros in mind? Glimmer and GeneMark both have long histories but I'm guessing they're optimized for short reads.

gene prediction annotation nanopore prokka glimmer • 1.9k views

ADD COMMENT • link updated 4.9 years ago by predeus ★ 1.9k • written 4.9 years ago by BioinformaticsLad ▴ 200

score 3 · Accepted Answer · 2019-05-28

3

Entering edit mode

4.9 years ago

predeus ★ 1.9k

Basically what you got is a very bad assembly, and there's no reason to look at gene predictions. There are two possible things you can do here

if you want a good assembly, you also need to polish yours with Illumina reads (or do a hybrid assembly with e.g. Unicycler) - that would ensure you have the least errors;
if you just want to get the idea which genes are present, take a reliable protein database (e.g. 18k well curated proteins used by Prokka) and run tblastn on your genome assembly. You won't be able to distinguish genes/pseudogenes, but otherwise you'll get a pretty good idea about gene presence/absence.

ADD COMMENT • link 4.9 years ago by predeus ★ 1.9k

0

Entering edit mode

Thanks, nanopore-only assemblies are indeed problematic. However, what confuses me is if we consider that 90% of most bacterial genomes are actual genes (5000), how can Prokka finds the same number of genes (5000 pseudogenes) in the remaining 10% of the genome? Unless it accounts for overlapping reading frames.

ADD REPLY • link 4.9 years ago by BioinformaticsLad ▴ 200

1

Entering edit mode

I think that overlapping CDS account for most CDS that you observe. You should be able to tell for sure by simply looking at the GFF file in a genomic browser.

ADD REPLY • link 4.9 years ago by predeus ★ 1.9k

1

Entering edit mode

That's a good suggestion.

ADD REPLY • link 4.9 years ago by BioinformaticsLad ▴ 200

1

Entering edit mode

With sufficient depth, Nanopore assemblies can be sufficient for bacterial work (I know a number of labs which don't even bother polishing with Illumina reads any more).

They are problematic if the bare minimum of coverage for a single contig is all you have. That said, as discussed in the other thread, I do still think this is a problem with the assembly.

Since the genome is the right size its unlikely to be a contaminant, but the massive depth you have could be causing problems of its own.

What tool did you do the assembly with?

ADD REPLY • link 4.9 years ago by Joe 21k

1

Entering edit mode

Sorry, but no amount of polishing can make Nanopore-only assemblies good enough for decent gene prediction, even if you have 1000x coverage (which happens often since bacterial genomes are small).

State-of-the-art nanopolish with methylation modelling only gives you 99.9% accuracy - one error per 1000 nt. Most of these are homopolymer indels, meaning your CDS prediction would be butchered in most cases.

ADD REPLY • link 4.9 years ago by predeus ★ 1.9k

0

Entering edit mode

Flye, the new kid on the block. Benchmarks say it's the 'best' for nanopore bacterial genomes. I tried Canu as well and it was okay but didn't match up to the reference as well as Flye (and also took way longer).

Sure, let me try downsampling and reassemble, see if that's the problem.

ADD REPLY • link 4.9 years ago by BioinformaticsLad ▴ 200