When deciding on cutoffs for filtering low-abundance or noisy taxa in a relative abundance table—especially for co-occurrence network construction in shotgun metagenomics—several key factors come into play. These help balance noise reduction with retention of biologically relevant signals, while accounting for your dataset's specifics (e.g., number of taxa and samples). Here's a breakdown:
Dataset characteristics: Sparsity is a big one in microbiome data; most taxa are rare or absent in many samples due to the compositional nature (proportions, not absolutes). With fewer samples (e.g., <50), looser cutoffs (e.g., >=0.01% in >=5% samples) prevent over-filtering and loss of rare but potentially interacting taxa. Larger datasets (e.g., >100 samples, thousands of taxa) tolerate stricter thresholds (>=0.1% in >=10-20% samples) to curb computational load and false edges in networks.
Downstream analysis goals: For co-occurrence networks, aggressive filtering reduces spurious correlations from low-count taxa (which amplify noise in Pearson/Spearman correlations). Studies show filtering rare taxa stabilizes network topology (e.g., modularity, degree distribution) by focusing on "core" interactors. However, if you're interested in rare taxa dynamics, use milder cutoffs or sensitivity analyses (test multiple thresholds and compare network metrics like clustering coefficient).
Biological and technical context: Human gut data often has high interpersonal variability, so consider prevalence across your cohort (e.g., retain taxa in >=10% of individuals for "common" members). Sequencing depth matters—shallower runs (<5M reads/sample) inflate rarity, so align cutoffs with your read counts. Also, taxonomy resolution (MetaPhlAn4 here) affects this; species-level bins may need higher thresholds than genus-level.
Validation approaches: No universal cutoff exists, but iterate: plot taxon prevalence vs. abundance histograms to visualize the "long tail" of rares, then evaluate post-filter network properties (e.g., via igraph or NetCoMi in R). Tools like phyloseq or microbiome package can automate threshold sweeps.
In practice, start with 0.01-0.1% abundance in 5-20% samples, then adjust based on retaining ~20-50% of taxa—enough for robust networks without sparsity overload.
(2) Clarifying the "relative abundance >= x% in at least y% of samples" criterion
Yes, your interpretation is spot-on: this is a prevalence-based filter, not a mean-based one. It means a taxon is considered "present" only in samples where its relative abundance exceeds the x% threshold (e.g., 0.01%), and it's retained if such presences occur in >= y% of all samples (e.g., 10%). Taxa below x% in a sample are effectively zeroed out (absent) for that sample before applying the prevalence check. This targets consistent, detectable signals while discarding transients or contaminants.
- Why this over mean relative abundance? Means dilute sporadic detections (e.g., a taxon at 1% in one sample but 0% elsewhere averages ~0.01% across 100 samples, risking false retention). Prevalence emphasizes ecological consistency, which is crucial for networks where weak/sporadic taxa inflate false co-occurrences.
Prevalence filtering is far more common in microbiome studies—seen in ~70-80% of recent papers on gut metagenomics (e.g., for DAA or networks). Mean-based cutoffs (e.g., global mean >=0.05%) appear occasionally for simplicity but are criticized for bias toward ubiquitous low-abunders; they're rarer in network-focused work. If using means, always pair with a min-count filter (e.g., >=10 reads total). In code (e.g., R's phyloseq::filter_taxa), it's straightforward: filter_taxa(function(x) sum(x > 0.0001) >= (0.1 * nsamples(physeq)), TRUE) for x=0.01%, y=10%.
(3) Renormalizing after filtering
Absolutely—yes, re-scale the remaining relative abundances to sum to 1 per sample before correlation calculations. Filtering removes taxa (and their contributions to the total), so the table becomes "sub-compositional," distorting pairwise correlations (e.g., inflating positives among survivors). This is standard to maintain the proportional integrity needed for methods like SparCC or SPIEC-EASI, which assume closed-sum data.
Quick how-to: In R, after phyloseq::filter_taxa(), apply otu_table(physeq) <- otu_table(sweep(otu_table(physeq), 2, colSums(otu_table(physeq)), "/"), taxa_are_rows=TRUE). Or in Python (qiime2/micca): divide each row by its new sum.
Caveat: If using count-based correlations (e.g., CLR-transformed), skip this and work from raw counts post-filter. But for relative abundance networks, renormalization prevents artifacts.
(4) Typical number of bacterial species in human gut (MetaPhlAn4)
In shotgun metagenomics of the human gut, MetaPhlAn4 typically detects 150-400 bacterial species per sample, depending on sequencing depth, host factors (e.g., diet, age), and cohort diversity.
- At standard depths (5-10M reads/sample), expect ~200-300 species on average—about 40-50 "core" (ubiquitous) plus 100-250 variables. Deeper sequencing (>20M reads) can push this to 400-500, including ~48 unknown species-level bins (uSGBs) not in reference catalogs. Across cohorts (e.g., HMP or ILO studies), totals span 100 (low-diversity, e.g., infants) to >500 (high-fiber diets), but medians hover ~250.
This is species-level; genus-level is broader (~50-100). For your network, this resolution suits co-occurrence well—focus on those post-filter.