I'd like to group genes of a specific genome (tomato in my case) into gene families. I am using sequence similarity (not domain), specifically with the software OrthoFinder.
The input I use is the whole proteome of the species. Interestingly, I am only getting ~18% of genes grouped into gene families. The rest of the genes are singletons. This is in contrast to most works and DBs I see, where most genes (sometimes ~80%) belong to gene families. In most cases I see people using multiple genomes of various species rather than a single genome. I suspect that this is the reason for the low extent of clustering, since addition of more genomes can create new graph edges.
My question is what would be the right way to achieve my goal, which is just clustering genes from a single genome. Should I:
1) stick with what I'm doing, since this is correct from the perspective of a single genome?
2) include proteins from other species in the analysis? This seems a bit strange and would mean that my results are dependent on the species I choose to include.
3) Use some strategy that assigns genes to pre-defined gene families rather than cluster from scratch?
4) something else?