Question: Problem improving mammal contig assemblies
1
gravatar for memory_donk
2.2 years ago by
memory_donk230
Australia
memory_donk230 wrote:

Hi Biostars, I was hoping someone may be able to give some guidance about troubleshooting a mammalian genome assembly. My species has a similar genome size to humans and I have ~150X coverage of 2x150bp Illumina reads for building contigs.

I've previously had modest luck assembling contigs for a closely related species with much lower coverage (~35-40X of 2x150bp Illumina). For the previous species I used SOAPdenovo2 so I gave that another shot. This time though my contig sizes were pathetically small despite the considerably greater coverage available for this species. I've verified that the data is indeed from the right species and the library isn't very biased toward one part of the genome or another (blasts of randomly selected reads turn up hits from close relatives and I get very high mapping percent to my previous related species' assembly with a fairly level coverage histogram across the genome). I've also tried deduplication of the read data which didn't change the results significantly.

I'm basically at a loss for why contig assembly should come out so much worse for this new species when the underlying genome itself is very similar to my previous species, when the library isn't very biased and when the coverage is significantly better.

As a secondary problem I've also tried SPAdes, but my dataset seems to be crashing the program (despite giving it ~900gb of memory). From what I've read, SPAdes must be loading the total dataset into memory (which is larger than available memory, about 950gb). Is there a good strategy for dividing up a dataset, assembling, then combining the assemblies?

assembly genome • 880 views
ADD COMMENTlink modified 2.2 years ago by h.mon24k • written 2.2 years ago by memory_donk230

You can subsample the dataset for SPAdes assembly.

ADD REPLYlink written 2.2 years ago by Sej Modha4.1k
1

Thanks for your reply. I'm just not certain how subsampling is a strategy for dividing the dataset and combining resulting assemblies. Writing a script to take random reads out of a file is easy, making high-quality assemblies from multiple smaller assemblies is a somewhat different problem though.

ADD REPLYlink written 2.2 years ago by memory_donk230
2

I tend to use subsampling and it almost certainly always gives better assembly for data with good depth. I'd generate assembly with subsampled reads and then align all reads back to the contigs and call consensus to ensure that the assembly represents the original data.

ADD REPLYlink written 2.2 years ago by Sej Modha4.1k
2

To second Sej, when coverage is very high errors start to repeat and then contigs are divided. Maybe you can change the parameters and set minimal coverage to a higher value.

ADD REPLYlink written 2.2 years ago by Asaf5.5k

Hi Asaf, I think you and Sej are on the right track. I looked back at my previous assemblies and can see the total assbly size is quite a bit larger than it ought to be (suggesting many extra contigs, possibly due to errors). I also ran one of my smaller sets of reads (~50X) and it improved the assembly somewhat over the total set. I might try a random sampling of the total set next. Do you think I should try error correction before or after subsampling?

ADD REPLYlink written 2.2 years ago by memory_donk230

Thank you Sej, I'll give that a try.

ADD REPLYlink written 2.2 years ago by memory_donk230

If anyone around has other suggestions I'd really appreciate it. I've tried downsampling and error correcting but neither made more than a trivial difference in contig N50. Everything I can tell about this dataset says its excellent quality of but I can't get even half-decent contigs.

ADD REPLYlink written 2.1 years ago by memory_donk230

So I've tried downsampling like people suggested in the comments without any success. The difference in contig size was basically trivial. I've tried 200X, 150X, 70X, 50X, 40X and 30X with very little difference in the result (a few hundred bp or so) and the largest contig has hardly changed in size (~28kb). I've also tried error correcting with again made a trivial difference. If anyone has other suggestions I'm getting a little desperate.

ADD REPLYlink written 2.1 years ago by memory_donk230
0
gravatar for h.mon
2.2 years ago by
h.mon24k
Brazil
h.mon24k wrote:

From SPAdes site:

SPAdes is not intended for larger genomes (e.g. mammalian size genomes). For such purposes you can use it at your own risk.

It seems you are stretching your luck using SPAdes with your genome.

SGA has a pre-assembly quality check module for diagnostics based on kmers, could give you some hints about why this genome is harder to assemble.

Anyway, de novo assembling mammalian-sized genomes without mate-pairs or long reads will never result in a decent genome.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by h.mon24k

I have long insert libraries. My question was about contig assembly so they are irrelevant my main issue was also with SOAPdenovo2 as I said. Ill try SGA and see if that provides any new info. Thanks

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by memory_donk230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1541 users visited in the last hour