I am trying to understand the cegma's report call: .completeness_report
I am focusing on the column #Prots (Prots = number of 248 ultra-conserved CEGs present in genome). For example I obtained 237 (in partial) which mean that in 248 ultra-conserved CEGs 237 are predicted in my genome.
In the other output files, the number of protein is much more because it contain all of the KOG (not 237 but 458 protein). But I am just interest in the ultra-conserved CEGs so I had filter all of my files (the reference 248 ids come from the file completeness_cutoff.tbl in cegma/data) and I was expected to generate an output with 237 ids, but I obtained 234 proteins ! How is it possible?
My second question is: What is the .number (e.g. KOG0002.2) after KOG ids in cegma output?
There are two sets of core eukaryotic genes (CEGs), a larger set (458 CEGs) that are designed to be used to help train a gene finder in novel genomes. All of the CEGMA output except the completeness report file refer to this larger set of core genes.
A subset of the 458 CEGs can be used to assess the completeness of the gene-space of your target genome. These 248 CEGs are taken from the larger set but CEGMA uses slightly different filtering criteria to determine whether these are present. So it is possible for CEGMA to report a CEG being present in the set of 458 CEGs but NOT in the subset of 248 CEGs.
Your original question refers to partial CEGs, these are candidate core genes that exceed a score threshold but which do not exceed a length threshold to be considered 'complete' (this is a somewhat arbitrary threshold... is 95% of a gene complete... how about 85%?). Your genome may contain many partial core genes none of which are complete and so none of which will be present in the other CEGMA output files.