Question: strange output from edgeR when searching differential expressed genes
0
gravatar for tujuchuanli
6 months ago by
tujuchuanli40
tujuchuanli40 wrote:

HI, I want to find differential expressed genes in TCGA datasets by edgeR. I used RNA-seq data measured by RPKM as input. The below code is basically followed edgeR manual section4.2 (https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) combined with a post on Biostars (https://www.biostars.org/p/212200/).

#reading files
file1="cancer.exprs.txt"

file2="normal.exprs.txt"

cancer.exprs = read.table(file = file1,header=T,sep="\t")

normal.exprs = read.table(file = file2,header=T,sep="\t")

#combine cancer and normal expression data and filter by median of expression value
exprs=cbind(cancer.exprs,normal.exprs)

rownames(exprs) = rownames(cancer.exprs)

median = apply(exprs,1,median)

exprs.filter = exprs[which(median >= 0.1),]

#put the expression data into edgeR object
y = DGEList(counts = exprs.filter, genes = rownames(exprs.filter))

#construct design model
Tissue = factor(c(rep("T",dim(cancer.exprs)[2]),rep("N",dim(normal.exprs)[2])))

data.frame(Sample = colnames(y),Tissue)

design = model.matrix(~0+Tissue)

#compute differential expressed genes

y = estimateDisp(y, design, robust=TRUE)

y$common.dispersion

fit = glmFit(y, design)

lrt = glmLRT(fit)

topTags(lrt,n=100)

cancer.exprs.txt and normal.exprs.txt are two files containing expression value measured by RPKM from cancer and normal samples. In these two files each column represents each sample and each row represents each gene. The first column and row are sample id and gene id. After running the code, I find that all genes are up-regulated genes with very huge logFC (I think it is log foldchange). Even I used cancer.exprs.txt as both cancer and normal input file, this code also consider all genes are up-regulated genes with very huge logFC. What is the wrong with me? Thanks

rna-seq edger • 300 views
ADD COMMENTlink modified 6 months ago by Michael Dondrup46k • written 6 months ago by tujuchuanli40

I will try raw counts. But I think it is unreasonable to ban to use RPKM as input. Under this situation, I can do nothing but use raw counts? Is there some ways to use RPKM as input? Thanks

ADD REPLYlink written 6 months ago by tujuchuanli40

RPKM has not been "banned" as input. edgeR uses a negative binomial distribution to model read counts, because read counts follow reasonably well the negative binomial distribution (whereas RPKMs do not) . So it is unreasonable to use data that do not follow the assumptions of the model, and expect sensible results.

ADD REPLYlink written 6 months ago by h.mon24k

Well, I tried raw counts as input. However the outcome still show that all genes are up-regulated genes. I think it is not because of RPKM (Although using raw counts is correct). It is probably because of my code. Is there something wrong about my code? Thanks.

ADD REPLYlink written 6 months ago by tujuchuanli40
1
gravatar for swbarnes2
6 months ago by
swbarnes25.2k
United States
swbarnes25.2k wrote:

cancer.exprs.txt and normal.exprs.txt are two files containing expression value measured by RPKM from cancer and normal samples.

That's your problem. The EdgeR documentation is clear that the software expects raw read counts.

edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries

ADD COMMENTlink written 6 months ago by swbarnes25.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour