Dear all, please would you advise us on the following :
for example, shall we have 2 batches of scRNA-seq data of 2 conditions (WT_batch1, WT_batch2, DISEASE_batch1, DISEASE_batch2), would the following approach be statistically legitimate in order to account/correct for the batch effect :
1 -- use CCA or MNNcorrect to account for the batch effects
2 -- followed by TSNE and network_based_clustering, in order to place correctly the cells in CORRECT CLUSTERS
3 -- and perform differential expression (with wilcoxon test, limma, edgeR, etc) between the CLUSTERS
We know that CCA or MNNcorrect only place the cells in more "correct" clusters after batch correction, and do NOT provide a batch - corrected expression value.
In this case, considering for instance cluster_0, could we combine :
a -- the matrix of cells : normalized_expression in cluster-0 in WT_batch1
with the matrix of cells : normalized_expresion in cluster-0 in WT_batch2
(let's call this matrix WT_batch1_batch2)
b -- the matrix of cells : normalized_expression in cluster-0 in DISEASE_batch1
with the matrix of cells : normalized_expresion in cluster-0 in DISEASE_batch2
(let's call this matrix DISEASE_batch1_batch2)
c -- and use limma or edgeR or DESeq2 on WT_batch1_batch2 versus DISEASE_batch1_batch2 in order to get the differential expression
we would prefer to combine the batches into WT_batch1_batch2 and respectively, DISEASE_batch1_batch2, as, sometimes, the number of cells in a cluster may be small (ie less than 200 cells)
or if there is any other approach that you'd recommend ..