I have single-cell RNA-seq data from 150 patients, categorized into 10 tumor subtypes. I performed non-negative matrix factorization (NMF), which identified 13 metaprograms. Each metaprogram has an associated activity score per cell.
To simplify the data, I calculated the average score of each metaprogram per patient, resulting in a matrix where rows represent patients and columns represent mean scores for the 13 metaprograms.
I visualized the distribution of each metaprogram across the 10 subtypes using boxplots, to explore whether any metaprogram is specifically enriched in a given subtype. The results looked promising.
Now, I want to apply a statistical test to determine which metaprograms show significant enrichment in specific subtypes. The metaprogram scores are already normalized between 0 and 1 at the single-cell level. I’d like a suggestion for the most appropriate test to assess this. why that test and how to do in R. okay i also attached dummy image for reference . The image displays boxplots of metaprogram enrichment scores across nine hypothetical tumor subtypes. Each panel represents a distinct metaprogram, showing its distribution and variability across the different subtypes.
Why not sticking to established analysis? Each meta-program is essentially a geneset. I would test for differential expression via limma (or similar, not Wilcox or any of this single-cell nonsense) and then go via competitive geneset tests, e.g. camera from limma. That is in my head 100x more robust than testing for enrichments between these boxplots. Proper DE, at best pseudobulking cells per celltype and patient ensures that you have statistically sound analysis that is biologically reproducible.