Question: Programs for bacterial gene prediction/annotation for nanopore assembly?
1
gravatar for BioinformaticsLad
7 weeks ago by
BioinformaticsLad130 wrote:

I've assembled a known E. coli genome only using nanopore reads. I want to check if the genes predicted matches the ground truth. I tried Prokka but it predicts way too many CDS (~10000). Although most of them are 'hypothetical proteins' so I'm guessing those are false positive ORFs caused by frameshifts as a result of nanopore's systemic indel errors.

Are there programs that work well specifically with nanopore erros in mind? Glimmer and GeneMark both have long histories but I'm guessing they're optimized for short reads.

ADD COMMENTlink modified 7 weeks ago by predeus1.1k • written 7 weeks ago by BioinformaticsLad130
3
gravatar for predeus
7 weeks ago by
predeus1.1k
Russia
predeus1.1k wrote:

Basically what you got is a very bad assembly, and there's no reason to look at gene predictions. There are two possible things you can do here

  • if you want a good assembly, you also need to polish yours with Illumina reads (or do a hybrid assembly with e.g. Unicycler) - that would ensure you have the least errors;
  • if you just want to get the idea which genes are present, take a reliable protein database (e.g. 18k well curated proteins used by Prokka) and run tblastn on your genome assembly. You won't be able to distinguish genes/pseudogenes, but otherwise you'll get a pretty good idea about gene presence/absence.
ADD COMMENTlink written 7 weeks ago by predeus1.1k

Thanks, nanopore-only assemblies are indeed problematic. However, what confuses me is if we consider that 90% of most bacterial genomes are actual genes (5000), how can Prokka finds the same number of genes (5000 pseudogenes) in the remaining 10% of the genome? Unless it accounts for overlapping reading frames.

ADD REPLYlink written 7 weeks ago by BioinformaticsLad130
1

I think that overlapping CDS account for most CDS that you observe. You should be able to tell for sure by simply looking at the GFF file in a genomic browser.

ADD REPLYlink written 7 weeks ago by predeus1.1k
1

That's a good suggestion.

ADD REPLYlink written 7 weeks ago by BioinformaticsLad130
1

With sufficient depth, Nanopore assemblies can be sufficient for bacterial work (I know a number of labs which don't even bother polishing with Illumina reads any more).

They are problematic if the bare minimum of coverage for a single contig is all you have. That said, as discussed in the other thread, I do still think this is a problem with the assembly.

Since the genome is the right size its unlikely to be a contaminant, but the massive depth you have could be causing problems of its own.

What tool did you do the assembly with?

ADD REPLYlink written 7 weeks ago by jrj.healey13k
1

Sorry, but no amount of polishing can make Nanopore-only assemblies good enough for decent gene prediction, even if you have 1000x coverage (which happens often since bacterial genomes are small).

State-of-the-art nanopolish with methylation modelling only gives you 99.9% accuracy - one error per 1000 nt. Most of these are homopolymer indels, meaning your CDS prediction would be butchered in most cases.

ADD REPLYlink written 7 weeks ago by predeus1.1k

Flye, the new kid on the block. Benchmarks say it's the 'best' for nanopore bacterial genomes. I tried Canu as well and it was okay but didn't match up to the reference as well as Flye (and also took way longer).

Sure, let me try downsampling and reassemble, see if that's the problem.

ADD REPLYlink written 7 weeks ago by BioinformaticsLad130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1706 users visited in the last hour