Dear Biostars community,
based on an unsupervised approach for multi-omics data integration for detecting molecular subtypes in a specific cancer type, I have different omics layers for the same patients (360): rna-seq expression data, CNV and somatic point mutations.
All of the different omics layers are on the gene level, with the number of features being around ~20k for both gene expression and mutations. As before fitting the model, I would like initially to perform feature reduction to reduce the number of features:
I was wondering except expression data, in which I could implement a non-specific intensity filtering and/or variance, how I could deal with the mutational data regarding the filtering process ? For example, the range of values in the CNV data are from -2 to 2 (GISTIC values), and for the somatic point mutations is 0 for silent mutations, and 1 elsewhere. Thus, one putative approach would be after gene expression filtering, to keep only the genes also in the mutational data that overlap ? As this could satisfy the approach of mutated genes that are expressed at least in a minimal number of samples?
On this premise, could an alternative filtering approach be implemented for the mutational data ? One major concern is that especially for the somatic point mutations, If I would filter based on the frequency of 0s (like no mutation events), I might loose genes that are mutated in a small number of samples but "within" a specific subtype...
Thank you in advance,
Efstathios