So here's the dilemma. I have illumina raw reads from a new undiscovered species of bacteria, and I'd like to assemble them as a draft genome. However, I don't know if my sequencing machine was able to cover 100% of the genome. I suspect it may be only 98% complete, and there may be gaps and artifacts that my runs missed and could not sequence. I want an exact number, because this 98% is a qualitative guess. However, all I have are a bunch of raw reads. I think they cover the genome an average of 25x, which is good. But, Is it even computationally possible to determine the quality/completeness of your assembly based on just raw reads? How should I change my approach to this problem?
The 'completeness' of a genome is an abstract concept not easy to check.
For one hand, you can try to assemble your reads using a de-novo approach, and extract general statistics just to have a general idea about the assembly (number of scaffolds, mean length, N50...).For other hand, you can compare the size of your de-novo assembled genome to the size of a phylogenetically closed bacteria specie with a well assembled genome and see if they are similar. Also, maybe I'd try to map the assembled scaffolds against the closed bacteria genome, and calculate % the genome covered.
This is what I'd do in your case... but for sure there are another things to do, and as I said, the completeness of an assembly is not something easy to know. Also is important to consider the 'genome mappability', which depends on each genome and affects the assembly.