Recently, I am doing analysis using sc-multiomic-ATAC&RNA-seq data from 10xGenomic platform.
Standard practice seems to suggest filtering out peaks overlapping repetitive elements to reduce technical noise and mapping artifacts. However, I'm concerned this might be throwing out the baby with the bathwater, as emerging evidence suggests many repetitive elements contain functional regulatory sequences. But in the tutorial of Signac/Seurat, there seems to be no filtering criteria related to it.
Questions for the Community
- What's the current best practice? Are you filtering all repeat-overlapping peaks, or using a more nuanced approach?
- Cell-type specificity concerns: Has anyone observed cell-type-specific accessibility patterns in repetitive elements that would be lost with aggressive filtering?
Downstream analysis impact: How much does repeat filtering affect:
- Cell type identification and clustering?
- Differential accessibility analysis?
- Integration with scRNA-seq data?
- Trajectory analysis?
My Proposed Tiered Approach Based on literature review, I'm considering a tiered filtering strategy:
Tier 1: Always Remove (High Confidence); Rationale: These are likely technical artifacts with minimal regulatory potential.
- Simple repeats (microsatellites, tandem repeats) Low complexity sequences Satellite DNA RNA genes (tRNA, rRNA, snRNA, etc.)
Tier 2: Context-Dependent (Moderate Confidence); Rationale: These can contain regulatory elements but are also sources of noise.
- LINEs (Long Interspersed Nuclear Elements)
- SINEs (including Alu elements)
- LTR retrotransposons
Tier 3: Usually Keep (Low Confidence for Removal); Rationale: Often contain regulatory sequences and show tissue-specific patterns.
- DNA transposons
- Rolling circle elements
- Unknown/unclassified repeats
Specific Technical Questions
- Overlap threshold: What percentage overlap should trigger filtering? 50%? 80%? Any overlap?
- Peak strength: Should we consider the accessibility signal strength when deciding whether to filter repeat-overlapping peaks?
- Repetitive categories: Should we need to consider the category of repetitive families when we do filtering?
What I'm Looking For
- Experiences from the community with different filtering strategies
- References to papers that have systematically evaluated this question
- Practical advice on balancing noise reduction with biological signal retention
- Tool recommendations for sophisticated repeat filtering
Thanks for your response. I think it may be better to keep those repetitive regions and only filter out the blacklisted regions. This is a single cell ATAC-RNA multiomic seq data analysis.