Good Day, I have been working with a dataset of tumor samples, which were initially MACS sorted negatively for very consistent markers of non-tumor cells (ie. T cells, macrophage, NK, etc.). This should yield a sample with purified tumor cells, but not perfect purification. We understood this imperfect purification. While the reasons for this purification method are irrelevant to the question, negative sort was necessary due to large sample size, and leaving the tumor cells untouched during the sort. Also, due to the large sample size, post-sort tumor quantification was not feasible. Our primary goal was to define transcriptional programs specific to the tumor, independent of infiltrating immune cells.
Following sort, RNA-seq on the purified tumor cell samples was run and analyzed in EdgeR. After running PCA and K-means, it is clear that the largest source of variation is related to sort quality as the genes on PC1 are associated with populations which are sorted out during MACS sorting. Standard filtration of genes (eg. minimum counts, minimum CPM/TPM) does not remove these genes likely due to a few samples.
I think of this as a problem similar to batch effects. However, the sort batches themselves are not associated with PC1. Alternatively, the pre-sort tumor content is most related to the contaminating gene quantification. Therefore, there's more of a "continuous" batch effect rather than a "categorical" batch effect. I am quite perplexed by this question, and find it enticing for a future project. I am not directly aware of methods around adjustments for continuous sources of variability. It sounds like a sort of linear model method. I have contemplated using ComBat-seq or SVA but am unsure as to the input. I don't really want just differential expression, but also adjusted counts data for PCA and methods related to supervised clustering.
Has anyone encountered such a problem and found a successful method? I am most familiar with R, and functional in python or matlab, if that helps. If there is another source of data which would be useful for solving the issue, feel free to suggest it.
Thank you for reading!
-J
Related to the idea of a continuous batch variable: https://support.bioconductor.org/p/99042/
Combat-Seq can accept a standard
model.matrix
so you can input any discrete or continuous variable and design you like. It operates on the raw counts and yields a batch-corrected "raw" count matrix which then can go into edgeR and company.