Discussions about QC of sc-multiomic-ATAC_RNA-seq from 10xgenomics platform in terms of using repetitive elements to filter peaks
2
0
Entering edit mode
16 days ago
gynecoloji • 0

Recently, I am doing analysis using sc-multiomic-ATAC&RNA-seq data from 10xGenomic platform.

Standard practice seems to suggest filtering out peaks overlapping repetitive elements to reduce technical noise and mapping artifacts. However, I'm concerned this might be throwing out the baby with the bathwater, as emerging evidence suggests many repetitive elements contain functional regulatory sequences. But in the tutorial of Signac/Seurat, there seems to be no filtering criteria related to it.

Questions for the Community

  • What's the current best practice? Are you filtering all repeat-overlapping peaks, or using a more nuanced approach?
  • Cell-type specificity concerns: Has anyone observed cell-type-specific accessibility patterns in repetitive elements that would be lost with aggressive filtering?
  • Downstream analysis impact: How much does repeat filtering affect:

    • Cell type identification and clustering?
    • Differential accessibility analysis?
    • Integration with scRNA-seq data?
    • Trajectory analysis?

My Proposed Tiered Approach Based on literature review, I'm considering a tiered filtering strategy:

  • Tier 1: Always Remove (High Confidence); Rationale: These are likely technical artifacts with minimal regulatory potential.

    • Simple repeats (microsatellites, tandem repeats) Low complexity sequences Satellite DNA RNA genes (tRNA, rRNA, snRNA, etc.)
  • Tier 2: Context-Dependent (Moderate Confidence); Rationale: These can contain regulatory elements but are also sources of noise.

    • LINEs (Long Interspersed Nuclear Elements)
    • SINEs (including Alu elements)
    • LTR retrotransposons
  • Tier 3: Usually Keep (Low Confidence for Removal); Rationale: Often contain regulatory sequences and show tissue-specific patterns.

    • DNA transposons
    • Rolling circle elements
    • Unknown/unclassified repeats

Specific Technical Questions

  • Overlap threshold: What percentage overlap should trigger filtering? 50%? 80%? Any overlap?
  • Peak strength: Should we consider the accessibility signal strength when deciding whether to filter repeat-overlapping peaks?
  • Repetitive categories: Should we need to consider the category of repetitive families when we do filtering?

What I'm Looking For

  • Experiences from the community with different filtering strategies
  • References to papers that have systematically evaluated this question
  • Practical advice on balancing noise reduction with biological signal retention
  • Tool recommendations for sophisticated repeat filtering
sequence scatac cell single 10xatac • 1.5k views
ADD COMMENT
1
Entering edit mode
16 days ago

I am going to guess you've seen/thoroughly read this paper about the ENCODE blacklist regions and how they were generated.

In short, they were generated using the input samples from pretty much every ChIP-seq input control in ENCODE:

This defines a comprehensive and cell-type agnostic signal across the genome that is unaffected by high signal from a particular cell-line (eg. CNVs) or low signal due to differential processing of input data. This defines a comprehensive and cell-type agnostic signal across the genome that is unaffected by high signal from a particular cell-line (eg. CNVs) or low signal due to differential processing of input data.

ChIP-seq, ATAC, CUTandRUN, etc, are all limited in some ways just by nature of the assay. You're correct that regulatory elements are often repetitive, but if you've got a read mapping to 20 different regions equally well, you've got no way to resolve that with short read sequencing (and long read just isn't helpful or sensible given the fragment sizes in these assays). That said, the whole element is often not repetitive and thus some reads end up mapping uniquely to it, which can still enable analysis.

In general, the regions with crazy high artificial signal are often skipped by peak callers anyway if input control are provided (though I realize such controls are not in play with the multiome kit). As for your questions, well, it depends. I haven't looked at the impacts on cell type identification and clustering, but I imagine doing no filtering would result in poorer separation of populations due to increased noise. Perhaps not, if you're only using the top X most variable peak regions to perform dimensionality reduction and cluster anyhow. But I don't see leaving them in as having any benefit. Not removing these regions can have large impacts on normalization, as differences in the proportions of reads mapping to these areas between samples will skew counts for "real" signal.

Differential analyses of such regions will largely be pointless - they'll be high in all samples, but potentially quite variable (as data/assay quality tends to trend with the proportion of reads mapping to such regions). And given those regions will also be high in input controls, it's an uphill battle to convince anyone that such differences are anything beyond noise (again, I recognize there are no input controls here).

For bulk ATAC, if you don't want to blanket remove regions based on a blacklist, you can make your own greylist from your input samples, but that's not a possibility here.

To get solid answers to your questions, I think you'd have to investigate your data carefully (or set up experiments to actually investigate these questions more directly, whatever those may look like).

ADD COMMENT
0
Entering edit mode

Thanks for your response. I think it may be better to keep those repetitive regions and only filter out the blacklisted regions. This is a single cell ATAC-RNA multiomic seq data analysis.

ADD REPLY
1
Entering edit mode
16 days ago
LChart 5.0k

It should be noted that concerns about mapping artifacts in ChIP-seq really come from an era where reads were 36bp (or even 25bp!) in length, and single-ended. The 10X data will be 2x50 at a minimum, often times 2x76 or even 2x150; in these cases mapping issues are much less of a concern. In my own analysis, if I stratify basic peak quality control plots by "peak overlaps ENCODE blacklist region" I see very minor differences. Personally, I stick to IDR (https://github.com/nboley/idr) for filtering called peaks.

ADD COMMENT
0
Entering edit mode

IDR addresses a different problem than removal of reads aligning to blacklist reads, though it's unquestionably useful for deriving more robust peak calls. It will not aid in removal of false signal though.

ADD REPLY

Login before adding your answer.

Traffic: 3288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6