Forum:[Detailed Question] How to Normalize of Metagenomics Gene Abundances (with HMMs models)
Entering edit mode
11 months ago
Davide • 0

Hi everyone,

Me and my colleagues have been thinking about normalization for quite some time, and we have a hard time finding a consensus both within us, and within the literature, for this reason, this is going to be a long and detailed post. Talking with other researchers we feel that they are also experiencing the same problems/questions, and therefore, we really hope we can get detailed answers, or instead, that we can start a dialogue on how to proceed with the normalization of metagenomic data. Our doubts are divided into two significant cases, normalization on metagenomes, and on MAGS.

CASE 1) Starting with the case of the metagenomes (shot-gun metagenomics):

Let us assume we start with metagenomes of interest, and we want to know the abundance of multiple genes of interest (as hmm models). Until now, our workflow is illustrated below:

  • Trimming (fastp)
  • Assembly (Metaspades)
  • ORF prediction (prodigal)
  • Read-mapping (bbmap) and ( to get the average coverage)
  • hmmsearch ( with hmm models)
  • Normalization?

Now, we will let you know what has been our procedure, and our major assumptions, and we would like to know, if they are correct and if not, why and how could we proceed in the best way possible.

To simplify we explain our workflow with just two genes (gene X and gene Y) and multiple metagenomes (Sample A, Sample B, and Sample C).

After running prodigal on our metagenomes, we mapped the reads with bbmap, and we get the average coverage of each contig.

First assumption: We are considering the average coverage of each identified orf as the average coverage of the contig where the orf was predicted, in order to reduce the bias of differential gene length. For example:

orf1 contig_1 17.5
orf2 contig_1 17.5
orf3 contig_2 5
orf4 contig_3 26.8

Then we run hmmsearch for our gene of interest (geneX) on the predicted orf protein sequences.

After we get the output from hmmsearch, we did our second assumption:

Second assumption: We defined the total coverage of gene X, as the sum of all the average coverage of the multiple hits obtained from hmm search.

gene X hits:

orf1 contig_1 31.6
orf2 contig_1 45.6
orf3 contig_2 25.9

gene Y hits:

orf6 contig_8 6.9
orf9 contig_11 3.4
orf11 contig_11 3.4

total coverage of gene X = 31.6 + 45.6 + 25.9 = 103.1 total coverage of gene Y = 6.9 + 3.4 + 3.4 = 13.7

Now, imagine that we have computed the abundance of gene X and gene Y in multiple metagenomes (all the samples above)

           sampleA      sampleB      sampleC       ...

geneX      103.1         53.2         99.2            
geneY      13.7          23.3         28.2              

We now need to perform some normalization to account for the differences between the metagenomes (for example, metagenome length).

Our temporary normalization goes as follows for each of the genes: Normalization= (Total coverage/total size of the metagenome) * 1000000 (for improved readability)

Checking the literature ( we saw that there are other types of normalization such as TMM, RLE, etc.

Which type of normalization would you consider, and why?. We think that we are following the right approach, but we are not sure if also the TMM or RLE can be applied to our procedure. We read also about normalization based on housekeeping genes. In this latter case, how will you define which housekeeping genes to use? and also, what if in the case that the assembler does not assemble the housekeeping genes due to their conservative nature? Do you have any suggestions for alternative normalization methods?

CASE 2) The second case we want to highlight regards the same procedure but without the average coverage. So after the HMMSEARCH, we sum the hits, and not the average coverage reported. This way instead of having (as shown above):

total coverage of **gene X** = 31.6 + 45.6 + 25.9 = 103.1
total coverage of **gene Y** = 6.9 + 3.4 + 3.4 = 13.7

we would have:

total hit counts of **gene X** = 3
total hit counts of **gene Y** = 3

This, in our opinion, illustrates how the two approaches can be biased: we have the same number of hits but with very different results.

Can this approach be used as well? what are the pros and the fallbacks of using this?

CASE 3) Now, regarding the MAGS approach (never tested):

Starting with the obtained high-confidence MAGS (after 3 binners, DAStool, and checkM). Do we need to think of another type of quantification and normalization specific to the MAGS? Such as the one employed by (, where we can do the assumption as the gene abundance can be considered the same as the MAG abundance, computed as follows:

For each MAG, the total length of mapped reads for individual scaffolds (mapped reads using BWA algorithm) is summed up and the total is then divided by the MAG size in bp. This number is then divided by the total number of reads to obtain the relative abundance.

Additionally, similarly to the previous case, when one does not have access to the raw reads, is it accurate to compute the abundance as (similarly to the second case):

gene abundance = the total number of hits/total number of orf predicted from the MAG?

can we use our approach also for the MAGs case?

Do you know another package (or integrated method) that is able to compute what we want?

We look forward to a detailed answer, and we hope we can start a productive dialogue about these issues.

Thank you very much.

normalization • 1.3k views
Entering edit mode
11 months ago
Mensur Dlakic ★ 27k

This isn't going to be a long post, because I don't see a point for calculating gene abundance.

MAG abundance can be calculated directly by CheckM. You would need all the bins from a given assembly and a mapping of all sequencing reads against the same assembly. Not sure how and why one would decouple the gene abundance from its MAG abundance.

Entering edit mode
11 months ago
Asaf 10k

My take of it would be not go through assembly, just map the reads to gene families, this might not be very accurate but I think that counting on the assembly will introduce more bias.

There is no good way to normalize it, all the assumptions about normalization fall apart in microbiome samples. I would go in the path of ANCOM which basically compares two genes (or genomes, the entity doesn't matter) and asks whether the abundance of one gene compared to the other is different in group A vs group B. You then ask the question about all the gene pairs (or gene pairs of interest) and if a gene is different in a lot of such cases then something is happening.

The advantage of ANCOM, besides avoiding normalization, is that you can run it with every statistical test you want, you can introduce fixed and random effects, run parametric or non-parametric tests etc.


Login before adding your answer.

Traffic: 1286 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6