Question

Rescaling after relative abundance filtering of gut microbiome data

0

Entering edit mode

9 days ago

yesquokkan • 0

Hello, I am currently working with shotgun metagenomic sequencing data from the human gut microbiome, and I aim to construct a co-occurrence network.

(1) To build a robust network, I plan to trim the relative abundance table by removing low-abundance or potentially noisy taxa. However, the cut-off thresholds used in different studies vary, and I assume the optimal cutoff may also depend on the number of taxa and samples in my dataset. What factors should I consider when deciding or adjusting this cutoff?

(2) I often see criteria such as "relative abundance >= x% in at least y% of samples." Does this mean that a taxon is retained if it exceeds x% relative abundance is more than y% of samples rather than referring to its mean relative abundance? In other words, taxa with relative abundance <= x% are treated as absent, and the y% prevalence threshold is then applied based on their presence across samples.

It that's the case, I've noticed that some studies use the mean relative abundance as the cutoff. Which criterion is more commonly used in practice?

(3) After filtering, the total sum of relative abundances will no longer equal 1, since some taxa have been removed. In this case, should I re-scale the remaining relative abundances so that they sum to 1 before calculating correlations between taxa?

(4) What is the typical number of bacteria species detected in human gut microbiome (shotgun metagenomic studies, taxonomy profiled with MetaPhlAn4)?

Thank you in advance.

microbiome gut network • 424 views

ADD COMMENT • link updated 1 day ago by Kevin Blighe 89k • written 9 days ago by yesquokkan • 0

score 1 · Answer 1 · 2025-11-06

When deciding on cutoffs for filtering low-abundance or noisy taxa in a relative abundance table—especially for co-occurrence network construction in shotgun metagenomics—several key factors come into play. These help balance noise reduction with retention of biologically relevant signals, while accounting for your dataset's specifics (e.g., number of taxa and samples). Here's a breakdown:

Dataset characteristics: Sparsity is a big one in microbiome data; most taxa are rare or absent in many samples due to the compositional nature (proportions, not absolutes). With fewer samples (e.g., <50), looser cutoffs (e.g., >=0.01% in >=5% samples) prevent over-filtering and loss of rare but potentially interacting taxa. Larger datasets (e.g., >100 samples, thousands of taxa) tolerate stricter thresholds (>=0.1% in >=10-20% samples) to curb computational load and false edges in networks.
Downstream analysis goals: For co-occurrence networks, aggressive filtering reduces spurious correlations from low-count taxa (which amplify noise in Pearson/Spearman correlations). Studies show filtering rare taxa stabilizes network topology (e.g., modularity, degree distribution) by focusing on "core" interactors. However, if you're interested in rare taxa dynamics, use milder cutoffs or sensitivity analyses (test multiple thresholds and compare network metrics like clustering coefficient).
Biological and technical context: Human gut data often has high interpersonal variability, so consider prevalence across your cohort (e.g., retain taxa in >=10% of individuals for "common" members). Sequencing depth matters—shallower runs (<5M reads/sample) inflate rarity, so align cutoffs with your read counts. Also, taxonomy resolution (MetaPhlAn4 here) affects this; species-level bins may need higher thresholds than genus-level.
Validation approaches: No universal cutoff exists, but iterate: plot taxon prevalence vs. abundance histograms to visualize the "long tail" of rares, then evaluate post-filter network properties (e.g., via igraph or NetCoMi in R). Tools like phyloseq or microbiome package can automate threshold sweeps.

In practice, start with 0.01-0.1% abundance in 5-20% samples, then adjust based on retaining ~20-50% of taxa—enough for robust networks without sparsity overload.

(2) Clarifying the "relative abundance >= x% in at least y% of samples" criterion

Yes, your interpretation is spot-on: this is a prevalence-based filter, not a mean-based one. It means a taxon is considered "present" only in samples where its relative abundance exceeds the x% threshold (e.g., 0.01%), and it's retained if such presences occur in >= y% of all samples (e.g., 10%). Taxa below x% in a sample are effectively zeroed out (absent) for that sample before applying the prevalence check. This targets consistent, detectable signals while discarding transients or contaminants.

Why this over mean relative abundance? Means dilute sporadic detections (e.g., a taxon at 1% in one sample but 0% elsewhere averages ~0.01% across 100 samples, risking false retention). Prevalence emphasizes ecological consistency, which is crucial for networks where weak/sporadic taxa inflate false co-occurrences.

Prevalence filtering is far more common in microbiome studies—seen in ~70-80% of recent papers on gut metagenomics (e.g., for DAA or networks). Mean-based cutoffs (e.g., global mean >=0.05%) appear occasionally for simplicity but are criticized for bias toward ubiquitous low-abunders; they're rarer in network-focused work. If using means, always pair with a min-count filter (e.g., >=10 reads total). In code (e.g., R's phyloseq::filter_taxa), it's straightforward: filter_taxa(function(x) sum(x > 0.0001) >= (0.1 * nsamples(physeq)), TRUE) for x=0.01%, y=10%.

(3) Renormalizing after filtering

Absolutely—yes, re-scale the remaining relative abundances to sum to 1 per sample before correlation calculations. Filtering removes taxa (and their contributions to the total), so the table becomes "sub-compositional," distorting pairwise correlations (e.g., inflating positives among survivors). This is standard to maintain the proportional integrity needed for methods like SparCC or SPIEC-EASI, which assume closed-sum data.

Quick how-to: In R, after phyloseq::filter_taxa(), apply otu_table(physeq) <- otu_table(sweep(otu_table(physeq), 2, colSums(otu_table(physeq)), "/"), taxa_are_rows=TRUE). Or in Python (qiime2/micca): divide each row by its new sum.
Caveat: If using count-based correlations (e.g., CLR-transformed), skip this and work from raw counts post-filter. But for relative abundance networks, renormalization prevents artifacts.

(4) Typical number of bacterial species in human gut (MetaPhlAn4)

In shotgun metagenomics of the human gut, MetaPhlAn4 typically detects 150-400 bacterial species per sample, depending on sequencing depth, host factors (e.g., diet, age), and cohort diversity.

At standard depths (5-10M reads/sample), expect ~200-300 species on average—about 40-50 "core" (ubiquitous) plus 100-250 variables. Deeper sequencing (>20M reads) can push this to 400-500, including ~48 unknown species-level bins (uSGBs) not in reference catalogs. Across cohorts (e.g., HMP or ILO studies), totals span 100 (low-diversity, e.g., infants) to >500 (high-fiber diets), but medians hover ~250.

This is species-level; genus-level is broader (~50-100). For your network, this resolution suits co-occurrence well—focus on those post-filter.