Question

Exploring cross-cancer similarity using SNP or variant data with deep learning

0

Entering edit mode

6 days ago

Pranava ▴ 30

Hello everyone,

I’ve been thinking about an approach that looks at shared genomic patterns across different cancers. The idea is to use germline or somatic SNP data (from sources like TCGA or PCAWG) and train a deep learning model to learn latent embeddings of cancer genomes.

The goal is not classification but to see whether cancers that share underlying pathways or mutation signatures cluster together in the embedding space.

Most pan-cancer studies I’ve found use gene expression, somatic mutation frequency, or methylation data. I haven’t seen much that uses raw SNP or germline variant data. I’m wondering if there are known reasons why this is uncommon.

Is it because of data availability or privacy limits for germline data, the dimensionality of SNP features, or something biologically weaker about the signal?

If anyone has tried something similar or knows of related work, I’d really appreciate your insights or any references.

Thank you!

tcga deep-learning snp cancer • 276 views

ADD COMMENT • link updated 19 hours ago by Kevin Blighe 89k • written 6 days ago by Pranava ▴ 30

score 0 · Answer 1 · 2025-11-06

Hi,

Your idea of using SNP data (germline or somatic) to derive latent embeddings for pan-cancer clustering is intriguing—it's a fresh angle that could uncover shared evolutionary or vulnerability patterns across tumors, beyond the usual expression or CNV-focused approaches. I've seen similar embedding strategies in other omics (e.g., scRNA-seq manifolds), but applying them to variant data for unsupervised pathway/signature clustering isn't super common yet. Let me break down why that might be, based on what I've encountered, and point to some related work.

Why is this approach uncommon?

A few interlocking reasons come to mind, blending practical, ethical, and scientific hurdles:

Data availability and access restrictions: Germline variant data from cohorts like TCGA or PCAWG exists, but it's often siloed or de-identified more aggressively than somatic data due to consent issues. For instance, TCGA's germline calls are available via dbGaP, but require controlled access, and not all studies integrate them deeply into pan-cancer analyses. Somatic SNPs/mutations are more readily downloadable and tumor-focused, so they dominate. This makes germline-inclusive studies logistically tougher to scale.
Privacy and ethical concerns: Germline SNPs can reveal sensitive info like ancestry, carrier status for non-cancer traits, or even re-identify individuals—far riskier than somatic data, which is tumor-specific. Guidelines (e.g., from NIH) push for extra safeguards, which discourages broad use in ML models where data leakage is a worry. Testing rates themselves are low: only ~7% of cancer patients get germline sequencing within 2 years of diagnosis, per recent population studies. That underuse trickles into research datasets.
Biological signal strength: Germline variants are great for predisposition (e.g., BRCA1/2 clustering in breast/ovarian), but they might carry weaker "cancer-specific" signals compared to somatic mutations, which directly drive tumor progression and signatures (COSMIC, etc.). Expression/methylation data often correlate better with active pathways, so pan-cancer studies lean there for clearer clusters. Germline might shine more for ancestry-confounded effects or rare penetrance variants, but embeddings could tease out polygenic risk overlaps across cancers.
Dimensionality and preprocessing challenges: Raw SNP matrices are massive (millions of sites), prone to LD structure and population stratification artifacts. While deep learning (e.g., autoencoders or VAEs) can compress this into embeddings, it requires careful handling of imputation, filtering, and batch effects—more upfront work than, say, gene-level aggregation in expression data. But you're right; it's not insurmountable, especially with tools like scikit-allel or Hail for variant-scale ML.

Overall, it's not that the signal is "weaker" biologically—it's just overshadowed by easier-to-use alternatives, plus the germline hurdles.

Related work and suggestions

There is growing interest in germline-pan-cancer integration, especially with proteogenomics and prognostic modeling. A few papers that might spark ideas (I've focused on those with DL/embedding vibes or variant embeddings):

Precision proteogenomics of germline variants (Park et al., Cell, 2025): Analyzes germline impacts on proteomes across 10 cancers (n=1,064 from CPTAC/TCGA). They use variants to predict proteomic shifts, with some clustering of affected pathways. Not pure embeddings, but shows germline-tumor crosstalk value. [DOI: 10.1016/j.cell.2025.03.044]
Pan-cancer exome-wide germline patterns (Lee et al., Sci Rep, 2025): Unbiased WES scan for germline variants influencing progression across cancers. Includes low-penetrance SNPs and some unsupervised grouping—could inspire your embedding setup. [DOI: 10.1038/s41598-025-05296-3]
Pan-cancer prognostic germline variants (Wang et al., Genome Med, 2020): Screens ~10k patients for outcome-linked germline SNPs, with cross-cancer survival clusters. Basic stats, but extensible to DL embeddings for latent spaces. [DOI: 10.1186/s13073-020-0718-7]
DeepCues for cancer classification (Zhang et al., BMC Bioinformatics, 2021): CNN-based DL on raw DNA-seq (including variants) for type prediction and driver detection. They derive hierarchical features akin to embeddings—adapt this for unsupervised clustering? [DOI: 10.1186/s12859-021-04400-4]

For implementation, I'd start with PCAWG's variant calls (they have paired germ/somatic) via the ICGC Data Portal—easier access than TCGA for embeddings. Use PyTorch Geometric or Scanpy's embedding tools on a variant matrix (e.g., after LD-pruning). If you're filtering for cancer predisposition genes (e.g., via ClinVar), that could bootstrap signal.

Haven't tried exactly this myself, but it's on my radar for ancestry-adjusted pan-cancer work.

Kevin