Question: Programs for bacterial gene prediction/annotation for nanopore assembly?
1
gravatar for BioinformaticsLad
12 months ago by
BioinformaticsLad150 wrote:

I've assembled a known E. coli genome only using nanopore reads. I want to check if the genes predicted matches the ground truth. I tried Prokka but it predicts way too many CDS (~10000). Although most of them are 'hypothetical proteins' so I'm guessing those are false positive ORFs caused by frameshifts as a result of nanopore's systemic indel errors.

Are there programs that work well specifically with nanopore erros in mind? Glimmer and GeneMark both have long histories but I'm guessing they're optimized for short reads.

ADD COMMENTlink modified 12 months ago by predeus1.3k • written 12 months ago by BioinformaticsLad150
3
gravatar for predeus
12 months ago by
predeus1.3k
Russia
predeus1.3k wrote:

Basically what you got is a very bad assembly, and there's no reason to look at gene predictions. There are two possible things you can do here

  • if you want a good assembly, you also need to polish yours with Illumina reads (or do a hybrid assembly with e.g. Unicycler) - that would ensure you have the least errors;
  • if you just want to get the idea which genes are present, take a reliable protein database (e.g. 18k well curated proteins used by Prokka) and run tblastn on your genome assembly. You won't be able to distinguish genes/pseudogenes, but otherwise you'll get a pretty good idea about gene presence/absence.
ADD COMMENTlink written 12 months ago by predeus1.3k

Thanks, nanopore-only assemblies are indeed problematic. However, what confuses me is if we consider that 90% of most bacterial genomes are actual genes (5000), how can Prokka finds the same number of genes (5000 pseudogenes) in the remaining 10% of the genome? Unless it accounts for overlapping reading frames.

ADD REPLYlink written 12 months ago by BioinformaticsLad150
1

I think that overlapping CDS account for most CDS that you observe. You should be able to tell for sure by simply looking at the GFF file in a genomic browser.

ADD REPLYlink written 12 months ago by predeus1.3k
1

That's a good suggestion.

ADD REPLYlink written 12 months ago by BioinformaticsLad150
1

With sufficient depth, Nanopore assemblies can be sufficient for bacterial work (I know a number of labs which don't even bother polishing with Illumina reads any more).

They are problematic if the bare minimum of coverage for a single contig is all you have. That said, as discussed in the other thread, I do still think this is a problem with the assembly.

Since the genome is the right size its unlikely to be a contaminant, but the massive depth you have could be causing problems of its own.

What tool did you do the assembly with?

ADD REPLYlink written 12 months ago by Joe16k
1

Sorry, but no amount of polishing can make Nanopore-only assemblies good enough for decent gene prediction, even if you have 1000x coverage (which happens often since bacterial genomes are small).

State-of-the-art nanopolish with methylation modelling only gives you 99.9% accuracy - one error per 1000 nt. Most of these are homopolymer indels, meaning your CDS prediction would be butchered in most cases.

ADD REPLYlink written 12 months ago by predeus1.3k

Flye, the new kid on the block. Benchmarks say it's the 'best' for nanopore bacterial genomes. I tried Canu as well and it was okay but didn't match up to the reference as well as Flye (and also took way longer).

Sure, let me try downsampling and reassemble, see if that's the problem.

ADD REPLYlink written 12 months ago by BioinformaticsLad150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour