for the initial analysis of our data set we would like to convert our single-cell RNA-Seq into Bulk RNA-Seq by summarizing the #reads per gene per sample.

I was wondering if anyone has already some experience with this kind of analysis.

Would it make sense to calculate the average expression for each gene in each sample (by dividing it with the number of cells in the sample) or just taking the `sum()`

of all the cells in each sample `as.is`

.

With the count matrix created with this methodology we would like to apply standard RNA-Seq analyses such as `DESeq2`

(for differential expression) or `Mfuzz`

(for time-series analysis).

The term you're looking for is "pseudo bulk" and you'll want to sum values across cells.

Hi,

Regarding your question what I did in the past was trying to compare different clusters of scRNA-seq

versusbulk RNA-seq using correlation indexes (it did not worked as expected!). For that purpose, what we did was to average the read countsperclusterpercondition/sample using`Seurat`

R package functions:Of course this will give you average read counts

perclusterper`stim`

variable condition. This is not exactly what you want, but if you have scRNA-seq data I would do diferentially gene expression analysis between different cell populations/clusters rather than the whole thing.António

thanks. After searching for the term "Pseudo bulk" I found more information. But it seems to me that as António mentioned above it all relates to calculating DE between clusters.

What we would like to do though is a differential expression analysis on the complete data set. We are encountering the problem that we are not yet sure about the correctness of the clustering results. For that reason we would like to first do a "standard" pseudo bulk RNA-Seq analysis on the complete data set by converting each sample (of course with differing number of cells) in to a single column in the new count matrix. We have partially a huge difference in the total number of cells (even up to 10fold, 9K vs. 90K). So I'm not sure, just calculating the sum of all cell won't create a too big of a difference between the samples.

This is why I was hoping, taking the average of all cell will give a better value for each gene across all samples.

Does it make sense? Or do you still think I should take the

`sum`

across all samples?Although a 10-fold is a quite big difference, the normalization procedure of

`DESeq2`

should mitigate the different read depth and, therefore, this difference. I believe that a PCA or sample-to-sample heatmap should highlight if this approach suppressed any potential bias caused by distinct sample read depth/coverage.Do you know why you have such a great difference? I guess is related with the number of cells in one sample versus another, but still is one order higher.

António