Question

Problem improving mammal contig assemblies

1

Entering edit mode

7.2 years ago

memory_donk ▴ 360

Hi Biostars, I was hoping someone may be able to give some guidance about troubleshooting a mammalian genome assembly. My species has a similar genome size to humans and I have ~150X coverage of 2x150bp Illumina reads for building contigs.

I've previously had modest luck assembling contigs for a closely related species with much lower coverage (~35-40X of 2x150bp Illumina). For the previous species I used SOAPdenovo2 so I gave that another shot. This time though my contig sizes were pathetically small despite the considerably greater coverage available for this species. I've verified that the data is indeed from the right species and the library isn't very biased toward one part of the genome or another (blasts of randomly selected reads turn up hits from close relatives and I get very high mapping percent to my previous related species' assembly with a fairly level coverage histogram across the genome). I've also tried deduplication of the read data which didn't change the results significantly.

I'm basically at a loss for why contig assembly should come out so much worse for this new species when the underlying genome itself is very similar to my previous species, when the library isn't very biased and when the coverage is significantly better.

As a secondary problem I've also tried SPAdes, but my dataset seems to be crashing the program (despite giving it ~900gb of memory). From what I've read, SPAdes must be loading the total dataset into memory (which is larger than available memory, about 950gb). Is there a good strategy for dividing up a dataset, assembling, then combining the assemblies?

Assembly genome • 2.2k views

ADD COMMENT • link updated 7.2 years ago by h.mon 35k • written 7.2 years ago by memory_donk ▴ 360

0

Entering edit mode

You can subsample the dataset for SPAdes assembly.

ADD REPLY • link 7.2 years ago by Sej Modha 5.3k

1

Entering edit mode

Thanks for your reply. I'm just not certain how subsampling is a strategy for dividing the dataset and combining resulting assemblies. Writing a script to take random reads out of a file is easy, making high-quality assemblies from multiple smaller assemblies is a somewhat different problem though.

ADD REPLY • link 7.2 years ago by memory_donk ▴ 360

2

Entering edit mode

I tend to use subsampling and it almost certainly always gives better assembly for data with good depth. I'd generate assembly with subsampled reads and then align all reads back to the contigs and call consensus to ensure that the assembly represents the original data.

ADD REPLY • link 7.2 years ago by Sej Modha 5.3k

2

Entering edit mode

To second Sej, when coverage is very high errors start to repeat and then contigs are divided. Maybe you can change the parameters and set minimal coverage to a higher value.

ADD REPLY • link 7.2 years ago by Asaf 10k

0

Entering edit mode

Hi Asaf, I think you and Sej are on the right track. I looked back at my previous assemblies and can see the total assbly size is quite a bit larger than it ought to be (suggesting many extra contigs, possibly due to errors). I also ran one of my smaller sets of reads (~50X) and it improved the assembly somewhat over the total set. I might try a random sampling of the total set next. Do you think I should try error correction before or after subsampling?

ADD REPLY • link 7.2 years ago by memory_donk ▴ 360

0

Entering edit mode

Thank you Sej, I'll give that a try.

ADD REPLY • link 7.2 years ago by memory_donk ▴ 360

0

Entering edit mode

If anyone around has other suggestions I'd really appreciate it. I've tried downsampling and error correcting but neither made more than a trivial difference in contig N50. Everything I can tell about this dataset says its excellent quality of but I can't get even half-decent contigs.

ADD REPLY • link 7.1 years ago by memory_donk ▴ 360

0

Entering edit mode

So I've tried downsampling like people suggested in the comments without any success. The difference in contig size was basically trivial. I've tried 200X, 150X, 70X, 50X, 40X and 30X with very little difference in the result (a few hundred bp or so) and the largest contig has hardly changed in size (~28kb). I've also tried error correcting with again made a trivial difference. If anyone has other suggestions I'm getting a little desperate.

ADD REPLY • link 7.1 years ago by memory_donk ▴ 360

score 0 · Answer 1 · 2017-01-30

0

Entering edit mode

7.2 years ago

h.mon 35k

From SPAdes site:

SPAdes is not intended for larger genomes (e.g. mammalian size genomes). For such purposes you can use it at your own risk.

It seems you are stretching your luck using SPAdes with your genome.

SGA has a pre-assembly quality check module for diagnostics based on kmers, could give you some hints about why this genome is harder to assemble.

Anyway, de novo assembling mammalian-sized genomes without mate-pairs or long reads will never result in a decent genome.

ADD COMMENT • link 7.2 years ago by h.mon 35k

0

Entering edit mode

I have long insert libraries. My question was about contig assembly so they are irrelevant my main issue was also with SOAPdenovo2 as I said. Ill try SGA and see if that provides any new info. Thanks

ADD REPLY • link 7.2 years ago by memory_donk ▴ 360