Question

Does gene raw read count mean and dispersion preserve batch effects if I used raw counts from different batches to calculate them?

0

Entering edit mode

19 months ago

aldas02027 • 0

Question: Does gene raw read count mean and dispersion preserve batch effects if I used raw counts from different batches to calculate them?

Context
I ask this because I need these values as inputs to the ssizeRNA_vary function of the ssizeRNA package. Alternatively, I can remove batch effects with removeBatchEffect() function of the limma package before calculating mean and dispersion. However, this function returns some counts as negative values. Consequently, the mean values of some genes become negative. Although, negative values are not allowed by ssizeRNA_vary function. Hence, I must use raw counts to calculate those two statistics.

I used the ssizeRNA package to calculate the sample size. This package has the main function ssizeRNA_vary, where you specify parameters to compute a sample size. The example parameters are as follows:

Total number of genes: G=10000;
The proportion of non-DE genes: pi0=0.8;
FDR level to control: fdr = 0.05;
Desired average power to achieve: power = 0.8;
Average read count for each gene in the control group: mu = 10;
Dispersion parameter for each gene: disp=0.1;
Fold change for each gene: fc=2.

However, you can replace mu and disp parameters with lists of mu and disp values of your dataset genes. The example of how I calculated mu and disp is below.

Code according to ssizeRNA vignette

#Load counts
counts <- read.csv("RCC_counts_v2.csv", row.names = 1)
counts[1:5,1:5]
                   SRR22708058 SRR22708059 SRR22708063 SRR22708068 SRR22708069
ENSG00000000003.15         583         233         375         571         290
ENSG00000000005.6            4           0           9           1           0
ENSG00000000419.14         385         564         230         366         333
ENSG00000000457.14         112         204          47         180         213
ENSG00000000460.17          59          95          30          89         109

#Load coldata
coldata <- read.csv("subtype_coldata_purity.csv", row.names = 1)

#Remove zero count genes
counts <- counts[rowSums(counts)> 0,]

#Obtain names of control group samples
normal_samples <- coldata %>% .[.$subtype == "normal",] %>% 
  rownames(.)

#Select control group counts 
normal_counts <- counts[, normal_samples]

#Calculate mean value for each gene
mu <- apply(counts, 1, mean)
mu[1:3]
ENSG00000000003.15  ENSG00000000005.6 ENSG00000000419.14 
        362.240741           7.888889         232.629630

#Calculate dispersion for each gene
d <- DGEList(counts)
d <-calcNormFactors(d) 
d <-estimateCommonDisp(d) 
d <-estimateTagwiseDisp(d) 
disp <- d$tagwise.dispersion
disp[1:3]
[1] 1.226187 4.109061 1.103636

n <- nrow(normal_counts)
size2 <-ssizeRNA_vary(nGenes = n, pi0 = 0.8, m = 200, mu = mu, disp = disp,
              fc = 2,fdr = 0.05, power = 0.8, maxN = 300, replace = T)

sample size ssizeRNA • 944 views

ADD COMMENT • link updated 19 months ago by GenoMax 152k • written 19 months ago by aldas02027 • 0