I am struggling with a statistical question related to RNA seq data.
I have collected RNA data on 4 different cell types collected from the same person. I have collected cells sets/batches from 3 individuals. Each batch of 4 cell types was prepared for RNA sequencing separately.
My goals are 1) to identify the relation between these cell types (which ones are most comparable and which ones are more dfferent) 2) to find genes differentially expressed between the cell types
After recieving back the sequencing data I noticed a clear batch/donor effect between the 3 sets of samples.
Standard normalisation procedures were not effective as the batch effect was different for genes with a high number of reads as compared to those with low read counts. Samples clustered according to donor not cell type.
What does work pretty nicely is to divide the expression values for each sample by the mean expression over all 4 cell types from that donor (e.g. scale for the batch difference per gene) and then cluster the thus scaled values as input for clustering (after log transformation). After doing this for all donors separately I get a nice clustering according to cell type.
My question now is whether this action is something you can do? I can not find any literature on a similar case.
Secondly I would like to no whether I can use the obtained values for statistical test to find DEGs such as ANOVA? I realize I have made the samples interdependent per donor by centering on the mean and I am removing variation in gene expression levels, so it does not feel completely right but because the clustering performs so well I am tempted to continue. Also because genes that are always higher in one cell type as compared to the other would be interesting to me.
Any feed back on possible mistakes I am intruducing and/or alternatives methods I can use are very much appreciated!