But I can't figure out what (% in sequence) (% in genome) (% in genus) (% in Cyanobacteria) and (% in Bacteria) in this table refer to respectively in this table. I can't go on with statistical analysis without knowing the exact meaning of those data. I hope someone could help me with this one.
Unfortunately, there is very little help or documentation available for the COG database. We are reduced to educated guess-work.
Taking the top row, COG J, I'd guess that:
% in genus = percentage of proteins from Acaryochloris that are COG J
% in Cyanobacteria = percentage of proteins from phylum Cyanobacteria that are COG J
% in Bacteria = percentage of proteins from kingdom Bacteria that are COG J
The first 2 columns are less obvious. I'd guess that "% in sequence" might be based on a sum of sequence lengths (coding?) and "% in genome" is percentage of proteins from that genome, but it is not clear at all.
Having said all that: I would not use COG - it is a very old database and is no longer maintained by the NCBI. You can get similar information from KEGG or the IMG (Integrated Microbial Genomes).