I would have a question regarding the genome assembly.
I was trying to use SPAdes to de novo assemble a genome. The data was obtained from a culture and it has some bacterial contamination due to the fact that the organism does not live without bacteria, and even filtration techniques don't remove all bacteria.
For sequencing we used 3 libraries with 500, 5000 and 10.000 bp insertions. These were sequenced on Miseq 2x300.
Reads were trimmed, and then the genome was assembled with SPAdes.
The problem is the average insert size estimation in spades. I always get this warning:
Estimated mean insert size 316.923 is very small compared to read length 300
I use default parameters, and I know that SPAdes is not the best assembler for long reads, but I wanted to try it.
So, for the 500 bp library the estimate is very close (475 bp), but for the 5k is around 316 bp average insert size, and for the 10k is around 426 bp average insert size.
Because of this reason I don't get long scaffolds. Also the number of N's in the final assembly is extremely low (4613 bases marked N in a 70 MB assembly). We prepared intentionally the libraries in this way to be able to get a good assembly, with long scaffolds. The longest scaffold is 150k and there are just 133 scaffolds above 50k, which is roughly around 10% of the entire 70 MB data.
Should I try to use a different assembler? Is this a normal thing? I don't expect that the sequencing company prepared three libraries of 500 bp inserts. Can you suggest another good assembler for multi-cell data? I have little experience in genomics and most of my work which was done previously was done on single cell genomes, but I know that spades can be used also for genomes whose data was obtained from multi-cell culture.