Hi, everyone. I have expression matrices for different cell types, representing the expression of individual cells of that type. They were learned through a generative model, so I am confident they represent the approximate expression patterns of specific cell types. Now, I want to implement a bulk sequencing deconvolution using the aforementioned expression patterns. The pseudo-bulk I used is the summation of a large number of single-cell sequencing data (sc-seq). That's the background.
My first approach is to design an optimization process to optimize a series of weights, so that the product of the weights and the cell type expression approximates the pseudo-bulk. I was advised to use Poisson loss as the loss function because it aligns with the biological characteristics of RNA-seq. However, I couldn't add non-negativity constraints to the weights during optimization, resulting in negative values in the optimization results, which is meaningless. Then I found the optimiza.nnls method in the scipy package, which implements non-negativity constraints, but it uses Euclidean distance to compare the pseudo-bulk and the sc-seq combination. I obtained some good results using this method, but I have the following questions.
1, Can I use Euclidean distance to compare the differences between two sequencing methods? To me, this problem seems to become a linear regression problem, i.e., combining sc-seq to approximate pseudo-bulk. At this stage, there don't seem to be any biological distribution assumptions, so I guess it's feasible.
2, If the answer to the previous question is no, what biological assumptions does using Poisson loss follow, and what am I ignoring when using Euclidean distance for comparison?
3, If I want to continue using Poisson loss to optimize weights, how should I set the non-negativity constraints on the weights? I have tried methods such as ReLU and softmax in machine learning, but the results are not good.