Hi Biostars, I was hoping someone may be able to give some guidance about troubleshooting a mammalian genome assembly. My species has a similar genome size to humans and I have ~150X coverage of 2x150bp Illumina reads for building contigs.
I've previously had modest luck assembling contigs for a closely related species with much lower coverage (~35-40X of 2x150bp Illumina). For the previous species I used SOAPdenovo2 so I gave that another shot. This time though my contig sizes were pathetically small despite the considerably greater coverage available for this species. I've verified that the data is indeed from the right species and the library isn't very biased toward one part of the genome or another (blasts of randomly selected reads turn up hits from close relatives and I get very high mapping percent to my previous related species' assembly with a fairly level coverage histogram across the genome). I've also tried deduplication of the read data which didn't change the results significantly.
I'm basically at a loss for why contig assembly should come out so much worse for this new species when the underlying genome itself is very similar to my previous species, when the library isn't very biased and when the coverage is significantly better.
As a secondary problem I've also tried SPAdes, but my dataset seems to be crashing the program (despite giving it ~900gb of memory). From what I've read, SPAdes must be loading the total dataset into memory (which is larger than available memory, about 950gb). Is there a good strategy for dividing up a dataset, assembling, then combining the assemblies?