I have 8 metagenomic samples (bacterial DNA) that I have generated count-matrixes for (I used Prokka for annotation). This means, that I now have the abundance of all genes, in all of my 8 samples. I have then normalized my samples with TPM (Transcript per Million reads), as my sample sizes are of different size, so i can better compare the samples. I want to group my genes into COGs (Cluster of Orthologous Groups). My goal is to look at the relative gene abundance of the different samples, and be able to compare specific COGs to other COGs across different samples. But now I approached a problem.
When I group my genes into COGs, the samples that have the most functionally annotated proteins (as opposed to "Hypothetical proteins"), will naturally have a higher abundance of total reads mapped to the COGs. Say a sample "A" have 60% Hypothetical proteins and sample "B" have 40% Hypothetical proteins, then sample "B" will have more functionally annotated proteins (60%), and thus more proteins (and consequently more reads) will group into each COG. Thus, when i compare 2 similar COGs across samples, in most cases COGs from sample "B" will have more reads mapped to them.
How do I solve this problem? If i re-calculate my TPM only for the COGs (disregarding Hypothetical proteins), would that give me a false picture of the relative gene abundance?