I would greatly appreciate some help with my problem.

I have just assembled denovo a genome from Illumina 100bp paired end reads, using SOAPdenovo2 and then GapCloser.

My total scaffold length is 1,062,995,336 base pairs (from 207528 scaffolds) and my haploid genome is approximately 1.2 Gb. From this I calculate a percentage coverage of 104%?

Have I calculated coverage incorrectly, or should I have filtered short scaffolds? I am unsure why the coverage is greater than 100%?

Thanks very much for any help


How did you calculate 104%? from what you've said, your assembly is 1.06 Gb in size, and you are expecting 1.2 Gb so wouldn't your coverage be 88% (1.06/1.2)?

What I would do would be to align the raw reads back to your scaffolds then genotype to compute your coverage.

For one assembly I have been doing currently, I experienced similar problem with SGA. Jared Simpson recommended me to remove anything smaller than 2x read length to avoid polymorphic or repetitive being over-counted. After I removed those short scaffolds, the total size of assembly came to be close to what I got from other assemblers.

