Hello on my laboratory we are looking for orthologous of a specific gene family. So I built a manual curated database for those genes. I extracted them as aminoacidic fasta files from uniprotkb. https://www.uniprot.org/uniprot/
I built the database with an unequal number of gene representatives . The priority was to retrieve as much gene variants as posible for each gene. The issue is that with some genes there was the case that the gene was only represented by a single aminoacidic sequence.
Here an example of genes and number of variants retrieved from uniprotkb:
here the "nar" gene was represented only by a single sequence in contrast to the "nagG" that has 182 representative sequences.
At the moment of mapping two environmental metagenomic read samples against this database of unequal number of representatives (with mmseqs2), matches against nagG or xyLE are more likely to occur than to "nar". And that is exactly what happened when analyzing the matches of the metagenomic reads against this database on a heat map.
So besides these results can show real biological signal, it can be precipitated to say that the metagenomic communities have lower abundance of "nar" gene or that the "nagG" gene is more present on the samples given the great difference variants used to build the database (1 for nar and 121 for nagG). So how to deal with this difference? should I standardize the abundances by the number of variants that represent each gene?
something like this?:
Abundance = N(gene) / n(gene_db)
where N(gene) is to the number of reads mapping to the specific gene on the database and "n" is to the number of variants of the gene on the database represented by its variants (or orthologous)