Question

statistical question (different number of gene orthologs on database)

0

Entering edit mode

23 months ago

v.berriosfarias ▴ 140

Hello on my laboratory we are looking for orthologous of a specific gene family. So I built a manual curated database for those genes. I extracted them as aminoacidic fasta files from uniprotkb. https://www.uniprot.org/uniprot/

I built the database with an unequal number of gene representatives . The priority was to retrieve as much gene variants as posible for each gene. The issue is that with some genes there was the case that the gene was only represented by a single aminoacidic sequence.

Here an example of genes and number of variants retrieved from uniprotkb:

example

here the "nar" gene was represented only by a single sequence in contrast to the "nagG" that has 182 representative sequences.

At the moment of mapping two environmental metagenomic read samples against this database of unequal number of representatives (with mmseqs2), matches against nagG or xyLE are more likely to occur than to "nar". And that is exactly what happened when analyzing the matches of the metagenomic reads against this database on a heat map.

heatmap

So besides these results can show real biological signal, it can be precipitated to say that the metagenomic communities have lower abundance of "nar" gene or that the "nagG" gene is more present on the samples given the great difference variants used to build the database (1 for nar and 121 for nagG). So how to deal with this difference? should I standardize the abundances by the number of variants that represent each gene?

something like this?:

Abundance = N(gene) / n(gene_db)

where N(gene) is to the number of reads mapping to the specific gene on the database and "n" is to the number of variants of the gene on the database represented by its variants (or orthologous)

database statistics • 617 views

ADD COMMENT • link updated 23 months ago by Asaf 10k • written 23 months ago by v.berriosfarias ▴ 140

score 1 · Answer 1 · 2022-05-22

I would avoid comparing read counts between nar to nagG genes altogether. There are a lot of potential biases when counting reads mapped to a gene, from DNA extraction and library preparation to gene length and biases in post-processing and mapping. The questions you _can_ answer are in the line of do samples in group A have a higher nar/nagG ratio than samples in group B.