Bacterial genome assembly, best uses of reads at excessively high sequencing depth?
1
1
Entering edit mode
9 weeks ago
ian.will ▴ 30

Hi all,

I've received some data for a bacterial genome project, where I'll be assembling two de novo genomes of bacteria with expected 4-5 Mbp genomes using PacBio Hifi reads (most reads at ~ 15kb). These two bugs got an entire SMRT cell to themselves and so the expected coverage here is like 2,000-3,000x. This is my first go with prokaryote genome assembly, otherwise have done a few fungi before.

My expected flow is some sanity checks with jellyfish+genomescope, hifiasm, and then annotation, maybe with a prokaryote tweaked funannotate pipeline (we have a little RNAseq data to help there too). (happy to hear about your favorite bacterial genome tools too). I'm taking a look at kmer counts with genomescope right now.

With such massive coverage, the genomescope models fail to converge [kmer=21, read.len=15,000, max.cov=10,000]. But eyeballing what looks like reasonable "true" peaks in that 2-3,000 fold coverage range, I simply dropped anything below 1,000x and got something more familiar. (the 1kb+ title should actually be 1,000x-coverage+, has nothing to do with read length) enter image description here

I've also seen recommendations (from years ago) for BBnorm to perhaps more smartly reduce my input data. Most of what I've seen about over-sequenced genomes was based on short reads and using SPADEs or so. Since I expect to use Hifiasm, that also allows me to introduce a kmer coverage cutoff during assembly to mimic sort of what I did with these genomescope plots.

So my question is, for these data (2k+ coverage pacbio hifi on bacteria) any specific recommendations for managing (reducing) my input data - and - maybe even using it to my advantage somehow? Having so many reads seems like an opportunity to make a very nicely polished and complete assembly.

Thanks!

bacteria pacbiohifi assembly coverage genome • 430 views
ADD COMMENT
0
Entering edit mode

While someone will have a more informed answer, downsampling (to 100x or less) the data may be a good place to start. Too much data can be a bad thing for assemblies.

colindaven is knowledgeable about long reads.

ADD REPLY
0
Entering edit mode

For sure, I think downsampling is the right call. I've been debating on whether to simply do this randomly (seqtk or bbmap should make that easy) or try to squeeze the most out of these reads and implement some minimum length or Q-score type filters to be choosy.

Only that, if read quality isn't random across the genome I can start introducing some weird bias into my assembly probably. I think Canu includes a subsampling step to take a pile of "nice" reads plus random/smaller reads to help with this ... but I've switched over to Hifiasm these days for assembly.

ADD REPLY
0
Entering edit mode
9 weeks ago

As Genomax says, I would just subsample to 60-100X using either head -n 10000 or whatever (check file size), or seqtk sample.

Then assemble with Flye, or Unicycle or Hifiasm, or maybe Raven (if it does Hifi too )?

ADD COMMENT
0
Entering edit mode

OK great, thanks for the advice. Keeping it easy and pulling random reads sounds like a plan. Might run the subsample+assembly a couple times just to check for consistency.

ADD REPLY

Login before adding your answer.

Traffic: 1337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6