Question

Nanopore sequencing bacteria and problems with NCBI PGAP annotation

0

Entering edit mode

2.6 years ago

GGG_Alex ▴ 20

Dear community,

we did recover and publish few draft genomes from nanopore sequencing. All were uploaded and annotated via the NCBI PGAP pipeline.

For our last genome the NCBI PGAP reports 30% frameshifted genes. However, prediction and annotation with prodigal and eggNOG did not report a substantial change in neighboring genes with same annotation (spanning the range of the PGAP predicted genes). NCBI concludes that sequencing is not correct, but we have not changed our procedure and coverage is ok (>60x)

Did anyone observe this before?

Any ideas how to fix or any other analysis that might be useful?

Thank you!

annotation ncbi PGAP nanopore gene • 1.2k views

ADD COMMENT • link updated 2.6 years ago by colindaven 7.7k • written 2.6 years ago by GGG_Alex ▴ 20

1

Entering edit mode

I would trust the NCBI PGAP because, on my personal experience, prodigal will still call frame-shifted genes.

ADD REPLY • link 2.6 years ago by andres.firrincieli 3.9k

1

Entering edit mode

I think we need to know the following before we can help

what nanopore flowcell or kit was used, 9.4.1 or 10.4 ?
what assembler was used ? (flye is good)
which long read polishing pipeline was used ? (medaka, racon?)
are illumina reads available for short read polishing ? (hypo, pilon etc)

If you haven't put a lot of effort into polishing then yes, maybe 30% of genes are frameshifted, because the dominant error model in ONT is indels.

ADD REPLY • link 2.6 years ago by colindaven 7.7k

0

Entering edit mode

Thank you for your reply,

We used 9.4.1, more specifically the flongle and bacterial species (Arthrobacter). I did super high accurate basecalling using guppy v6.x.

I used flye (also best to my expericence) and it gave a single closed chromosome. I did not polish with another tool, just trusting the flye polishing.

No illumina reads available so far.

I am wondering the the 6 other genome we generated with the same pipeline were all fine so far (~2-10% psuedogenes from PGAP). I also tryed to use a fraction of the reads, that have better average quality but I did not get a fully closed genome (which I would prefer) and a lower coverage.

ADD REPLY • link 2.6 years ago by GGG_Alex ▴ 20

1

Entering edit mode

Ah, sounds good.

I would definitely polish though, this very minimal polishing pipeline might help ( paths to the programs will need editing, but it gives you an idea how to run long read only polishing).

https://github.com/Colorstorm/assembly_polishing_racon_medaka

I would polish all assemblies before submission. People used to do 2-3 rounds of racon, plus medaka. Then illumina. See if you can improve the base quality.

ADD REPLY • link 2.6 years ago by colindaven 7.7k