Good morning. I am having several problems with a WGS assembly. I have to assemble the genome of a yeast strain, although there are several genome assemblies for that species. We need to do it de novo to capture all the variability. The estimated genome size is around 11-12 Mb (as shown in the published assemblies). Unfortunately, I don't have information about the adapters or indexes used specifically, I only know that the library was done with NEBnext ultra, the reads are paired ends with 151 pb and the sequencer is Illumina Hiseq2500.
As I didn't know the adpaters my workflow was:
- fastp with automatic adapter detection on and default parameters. 2.Merged the paired reads
- SPADES assembly with default parameters and 21,33,55,77 and 99 k-mer sizes
- QUAST to quality measurement.
I tried different trimming options on fastp and with and without merging reads and always I obtained an assembly with a good N50 (around 200.000) but with 20.5 Mb, which is nearly double what I expected.
As additional information, I used a K-mer counting histogram and genome scope to detect polyploid and genome size and this it how it looks:
I am confused and don't know what I could do or what could be the problem, but I think the genome scope image is abnormal. Someone could suggest something I missed?
Thank you very much in advance
Have you checked, using a reciprocal best hit search, how many of the assembled contigs have counterparts in the assemblies of the existing strains? What about BUSCO scores? Are these scores heavily duplicated? Have you tried assembling without merging the reads beforehand?
Generally for genome assemblies one would make libraries with insert sizes larger than number of cycles of sequencing. It seems odd that you are able to merge the reads when you have 151 bp reads. What is the average insert size in your libraries?