I want to identify differential genes (DEG) in TCGA dataset (cancer samples vs normal samples) by combination of
sva (get rid of batch effect by
sva). The following is my code and I am not sure if it is ok or not. Could you please help me to check the code and give me some suggestions?
I mainly referred to the following three posts and the manual of these two packages.
Below is my code:
the first step is constructing file and filtering out lowly expressed genes. expres.filter is the matrix of reads counts for each gene in each sample. In this matrix, each row is a gene and each column is a sample.
y=DGEList(counts=exprs, genes=rownames(exprs)) keep <- rowSums(cpm(y)>0.1) >= 20 y <- y[keep, , keep.lib.sizes=FALSE] y$samples$lib.size <- colSums(y$counts) y=calcNormFactors(y) y.cpm.values <- cpm(y$counts, log = F)
the second step is batch effect detection by sva, there are 120 cancer samples and 20 normal samples in my dataset.
Tissue <- factor(c(rep("T",120),rep("N",20)) mod <- model.matrix(~Tissue, data=Tissue) mod0 <- model.matrix(~1, data=Tissue) svseq = svaseq(y.cpm.values, mod, mod0) design <- cbind(mod, svseq$sv)
the third step is identifing DEG by edgeR
y <- estimateDisp(y, design, robust=TRUE) fit <- glmQLFit(y, design) lrt <- glmTreat(fit,lfc=log2(1.5))
Finally, I get my DEG list.