Question

Appropriate normalization of skewed data for "Translating multiomics single-cell data" with MOSCOT

0

Entering edit mode

20 days ago

npont ▴ 20

Hi all,

I have two sets of cells defined by a distinct modality as follows:

one set is defined by RNA expression data from scRNA-seq data
one set is defined by custom values reflecting methylation states

I would like to use MOSCOT "Translating multiomics single-cell data" ( tuto: https://moscot.readthedocs.io/en/latest/notebooks/tutorials/600_tutorial_translation.html ) to align those two modalities and downstream be able to perform clustering on those two groups of cells.

My problem is the following: those methylation states are for more than 80% negative values, and those negative values correspond to noise (methylation that did not work well). So I am mostly interested by the positive values; indeed genes having high values are very interesting for me and are expected to also have high RNA expression values. While the RNA expression is normalized by library size and log transformed, I struggle to find how to normalize the methylation values. I'm scared that if I apply a z-score it would hide the importance of my positive values as everything (including negative values) will be scaled and shifted. So I need to find a normalization for the methylation that take into account the importance of this minority of positive values, and make it still comparable with log normalized RNA expression values (because MOSCOT is based on optimal transport which is sensitive to scale and which implicitly assume comparability across the distributions of the two modalities).

PS: I can set the negative methylation values to zero if that help.

Any help on that topic would be super appreciated, Thank you very much !

normalization biostatistics scrna-seq anndata moscot • 342 views

ADD COMMENT • link updated 19 days ago by Kevin Blighe ★ 90k • written 20 days ago by npont ▴ 20

score 0 · Answer 1 · 2025-11-12

Hi, I'd suggest first clipping those negative methylation values to zero (as you mentioned) to treat them as true absences/noise, then applying a per-cell log1p transformation—similar to your RNA normalization—to handle the sparsity and emphasize the positive signals without diluting them via global z-scoring. This keeps the distribution skewed towards high positives, making it more comparable to log-normalized RNA for MOSCOT's optimal transport (which is indeed scale-sensitive). Next, compute a low-dimensional embedding like PCA (e.g., 50 components) on the transformed methylation matrix, and L2-normalize it as the source attribute in the TranslationProblem (per the tutorial's ATAC example), using the RNA PCA as target. For the fused setting, if you have overlapping features, integrate via scVI on the positives to guide alignment. This should preserve the biological relevance of your high-methylation sites while enabling robust downstream clustering on the translated embeddings.

Kevin