I am interested in studying the variation in gene family size across different mammalian species. So I obtained the gene- family annotation data from Panther data base and then I count the number of genes belonging to each gene family in different mammalian species. However, I suspect that there might be annotation bias in gene family annotation. For example, A gene family named XXX might be having more number of genes in humans as compared to that in platypus because: 1. The genomic sequence is studied at better quality as compared to platypus. 2. And/Or the proteome of platypus is less well studied and hence many proteins might be unannotated in platypus. Infact when I made the boxplots of the distribution of gene family size in different species, Some species show greater median family size as compared to other.
So is it right to say that there is annotation bias in gene family size based on above two factors, or based on some other factor which i didnt mention? Secondly, Is there any way to fill up for the possible annotation bias? I have tried using HOGENOM data also and similar bias is evident in that data too. Will defining gene families on my own by doing all against all alignment of genomes for my species of interest help in eliminating the possible biases.
Any input/suggestion/direction/ will be highly helpful to me as I am relatively new in this field.