Question: Does gene raw read count mean and dispersion preserve batch effects if I used raw counts from different batches to calculate them?
Context
I ask this because I need these values as inputs to the ssizeRNA_vary function of the ssizeRNA package. Alternatively, I can remove batch effects with removeBatchEffect() function of the limma package before calculating mean and dispersion. However, this function returns some counts as negative values. Consequently, the mean values of some genes become negative. Although, negative values are not allowed by ssizeRNA_vary function. Hence, I must use raw counts to calculate those two statistics.
I used the ssizeRNA package to calculate the sample size. This package has the main function ssizeRNA_vary, where you specify parameters to compute a sample size. The example parameters are as follows:
- Total number of genes: G=10000;
- The proportion of non-DE genes: pi0=0.8;
- FDR level to control: fdr = 0.05;
- Desired average power to achieve: power = 0.8;
- Average read count for each gene in the control group: mu = 10;
- Dispersion parameter for each gene: disp=0.1;
- Fold change for each gene: fc=2.
However, you can replace mu and disp parameters with lists of mu and disp values of your dataset genes. The example of how I calculated mu and disp is below.
Code according to ssizeRNA vignette
#Load counts
counts <- read.csv("RCC_counts_v2.csv", row.names = 1)
counts[1:5,1:5]
SRR22708058 SRR22708059 SRR22708063 SRR22708068 SRR22708069
ENSG00000000003.15 583 233 375 571 290
ENSG00000000005.6 4 0 9 1 0
ENSG00000000419.14 385 564 230 366 333
ENSG00000000457.14 112 204 47 180 213
ENSG00000000460.17 59 95 30 89 109
#Load coldata
coldata <- read.csv("subtype_coldata_purity.csv", row.names = 1)
#Remove zero count genes
counts <- counts[rowSums(counts)> 0,]
#Obtain names of control group samples
normal_samples <- coldata %>% .[.$subtype == "normal",] %>%
rownames(.)
#Select control group counts
normal_counts <- counts[, normal_samples]
#Calculate mean value for each gene
mu <- apply(counts, 1, mean)
mu[1:3]
ENSG00000000003.15 ENSG00000000005.6 ENSG00000000419.14
362.240741 7.888889 232.629630
#Calculate dispersion for each gene
d <- DGEList(counts)
d <-calcNormFactors(d)
d <-estimateCommonDisp(d)
d <-estimateTagwiseDisp(d)
disp <- d$tagwise.dispersion
disp[1:3]
[1] 1.226187 4.109061 1.103636
n <- nrow(normal_counts)
size2 <-ssizeRNA_vary(nGenes = n, pi0 = 0.8, m = 200, mu = mu, disp = disp,
fc = 2,fdr = 0.05, power = 0.8, maxN = 300, replace = T)