statistical question (different number of gene orthologs on database)
1
0
Entering edit mode
5 weeks ago

Hello on my laboratory we are looking for orthologous of a specific gene family. So I built a manual curated database for those genes. I extracted them as aminoacidic fasta files from uniprotkb. https://www.uniprot.org/uniprot/

I built the database with an unequal number of gene representatives . The priority was to retrieve as much gene variants as posible for each gene. The issue is that with some genes there was the case that the gene was only represented by a single aminoacidic sequence.

Here an example of genes and number of variants retrieved from uniprotkb:

example

here the "nar" gene was represented only by a single sequence in contrast to the "nagG" that has 182 representative sequences.

At the moment of mapping two environmental metagenomic read samples against this database of unequal number of representatives (with mmseqs2), matches against nagG or xyLE are more likely to occur than to "nar". And that is exactly what happened when analyzing the matches of the metagenomic reads against this database on a heat map.

heatmap

So besides these results can show real biological signal, it can be precipitated to say that the metagenomic communities have lower abundance of "nar" gene or that the "nagG" gene is more present on the samples given the great difference variants used to build the database (1 for nar and 121 for nagG). So how to deal with this difference? should I standardize the abundances by the number of variants that represent each gene?

something like this?:

Abundance = N(gene) / n(gene_db)

where N(gene) is to the number of reads mapping to the specific gene on the database and "n" is to the number of variants of the gene on the database represented by its variants (or orthologous)

database statistics • 211 views
ADD COMMENT
1
Entering edit mode
5 weeks ago
Asaf 8.9k

I would avoid comparing read counts between nar to nagG genes altogether. There are a lot of potential biases when counting reads mapped to a gene, from DNA extraction and library preparation to gene length and biases in post-processing and mapping. The questions you _can_ answer are in the line of do samples in group A have a higher nar/nagG ratio than samples in group B.

ADD COMMENT

Login before adding your answer.

Traffic: 1402 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6