Question

Using np to target subsampling

0

Entering edit mode

2.9 years ago

andrew ▴ 10

Hello,

I'd be very interested to know what recommendations there are for subsampling for coassembly, when computational resources are not available for the full dataset. In my use-case, it would be for de novo assembly with megahit (single node) or metahipmer (multinode).

I have read about normalisation based approaches, but since these mess with coverage I know many would discourage, and the metahipmer developers definitely discourage this.

Random subsampling seems reasonable, but I worry that since my depth varies wildly between samples (due to varying proportion of microbial reads in our human samples), and complexity/coverage will vary, it may not be best to subsample all samples to the same extent.

A possible improvement would be subsampling down to an absolute maximum depth per sample, such that low depth samples are not subsamples, and high depth samples are subsampled more aggressively. However, this would still not take into account that at the same depth, one sample may be well covered (due to low complexity) and another may be poorly covered (due to high complexity).

This leads me to the idea of using Nonpareil curves to guide subsampling. I am considering an approach whereby for each sample I estimate the total base pairs required to achieve (say) 0.95 coverage from each sample. Those with proportion >= 1 are not subsampled, and those with proportion < 1 are subsampled to the required proportion of reads. Thus I reduce the total number of reads, bit more aggressively in better covered samples.

In my head, it feels like this might provide an efficient way of subsampling for assembly. I appreciate that time and memory usage of de novo assemblers is not dependent primarily on the number of sequences, but rather unique kmers and graph structure. Thus, subsampling 50% per sample, and subsampling to 50% overall but with the Nonpareil strategy would perform differently. This doesn't stop the idea of targeting by coverage seem more appropriate.

I will be really grateful for any thoughts! I will be experimenting in tandem!

Best wishes,

Andrew

nonpareil coassembly metagenomics • 1.2k views

ADD COMMENT • link updated 2.9 years ago by h.mon 35k • written 2.9 years ago by andrew ▴ 10

GenoMax · Answer 1 · 2021-05-11

1

Entering edit mode

2.9 years ago

andrew ▴ 10

In case anyone thinks it's interesting, I tried the nonpareil targeted subsampling on 5 sets of read pairs, and compared it with the original reads, and with even subsampling to the same depth. I ignored any contigs less than 1000 base pairs

Original: 309 million bases assembled, largest 452,000, N50 3877
NP subsampling: 163 million bases assembled, largest 230,000, N50 8990
Random subsampling: 129 million bases assembled, largest 476,000, N50 7759

ADD COMMENT • link updated 2.9 years ago by GenoMax 141k • written 2.9 years ago by andrew ▴ 10

0

Entering edit mode

Add digital normalization to the comparison, you got me curious.

Also, do not forget, the metrics you are reporting are useful, but aren't the only ones when evaluating which method / final assembly is better.

ADD REPLY • link 2.9 years ago by h.mon 35k

score 0 · Answer 2 · 2021-05-10

0

Entering edit mode

2.9 years ago

h.mon 35k

Specially for metagenomics, subsampling is bad, as it will discard reads irrespective if they are rare (from a uncommon, low frequency organism) or common. Thus, you will be throwing away data you don't want to throw away. From what I understood, your method will still discard uncommon data, thus worsening coverage in already poorly represented organisms.

For co-assembly of all datasets together, I think there is no problem in performing digital normalization. True, the literature has examples of assemblers that perform better and of assemblers that performs worst after digital normalization (I suspect this also depends on the particular data set at hand), but I would guess the former is more common than the later.

ADD COMMENT • link 2.9 years ago by h.mon 35k

0

Entering edit mode

Thanks h.mon. It's a great point -- it feels like balancing two evils: information loss, and coverage. Random subsampling loses more information, but preserves coverage, whereas digital normalisation preserves more data but disrupts coverage signals (which I understand are important for breaking up the assembly graph). If, theory aside, most assemblers perform better despite digital normalisation, then perhaps I should just bite that bullet. In theory, at least, I hoped that my approach was a worthwhile compromise -- preserve coverage data, and lose data preferentially from well covered samples, where the data loss will matter less.

In searching for benchmarks of the above, I have come across the idea of graph based partitioning with khmer -- to date I was only aware of partitioning based on things like GC content. The PNAS paper showing that the same contigs were generated after partitioning is very encouraging: https://www.pnas.org/content/109/33/13272

ADD REPLY • link 2.9 years ago by andrew ▴ 10

0

Entering edit mode

Though I note that read partitioning with khmer is said to be deprecated and not recommended (http://ivory.idyll.org/blog/2016-partitioning-no-more.html)

ADD REPLY • link 2.9 years ago by andrew ▴ 10