I came across a list of criteria that NCBI uses to select reference genomes, which includes CheckM completeness among other factors, but I couldn't find whether there is a defined cutoff value that an assembly must meet to be considered eligible.
Does anyone know if NCBI uses a specific threshold for CheckM completeness for a reference genome selection? Or is it purely comparative across available assemblies for a given species?
It appears there is no strict threshold, but it depends on how many other assemblies for a given taxa are already submitted. I think their logic is it's okay to accept a lower quality genome if it's covering gaps in taxonomy until better ones come along.
In their prokaryotic release notes, this is what it says:
Added CheckM completeness cut-offs to validate annotation. An annotated assembly will only be added to the RefSeq collection if it meets the following criteria:
For species with more than 1000 assemblies in RefSeq, the completeness is higher than the species Average Completeness - 3 times the standard deviation .
For species with 10-1000 assemblies in RefSeq, the completeness is higher than the smaller of 90% or the species Average Completeness - 3 times the standard deviation .
No CheckM cutoff is applied if there are less than 10 assemblies in the species.
And with their documentation on selecting a genome, it appears CheckM is not considered for eukaryotic genomes.
In order, assemblies with the highest quantized level of completeness
(98 to 100) are preferred over assemblies in the 95-98, 90-95, 85-90,
70-85, 50-70, and under 50 percent level of completeness, as
determined by CheckM.
Thank you very much for the clarification!