Question: Edger Results Without Replicates, Fdr Looks Unnormal
3
gravatar for xiaojuhu13
5.0 years ago by
xiaojuhu13150
China
xiaojuhu13150 wrote:

I only get two samples without replicates for the edgeR analysis,but the results look unnormal,I set the common dispersion value equal to 0.4,most FDR equal to 1.

> raw.data<-read.table(file="48_50_1",header=T)
> d<-raw.data[,2:3]
> rownames(d)<-raw.data[,1]
> group<-factor(c("L","H"))
> design<-model.matrix(~group)
> d<-DGEList(counts=d,group=group)
> dim(d)
[1] 22928     2
> d <- calcNormFactors(d)
> keep <- rowSums(cpm(d)>100) >= 2
> d <- d[keep,]
> dim(d)
[1] 184   2
> d<-estimateGLMCommonDisp(d,design,method="deviance",robust="TRUE",subset=NULL)
Warning message:
In estimateGLMCommonDisp.default(y = y$counts, design = design,  :
  No residual df: setting dispersion to NA
> d$samples$lib.size <- colSums(d$counts)
> d
An object of class "DGEList"
$counts
        low high
AATK     45   77
ABCB1    56  120
ARHGAP4 261  383
C3      920  628
COL20A1  26   33
179 more rows ...

$samples
     group lib.size norm.factors
low      L    62017     1.046646
high     H    83398     0.955433

$AveLogCPM
[1]  9.742665 10.243595 12.110485 13.423755  8.758667
179 more elements ...

$common.dispersion
[1] NA

$design
  (Intercept) groupL
1           1      1
2           1      0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

> d$common.dispersion=0.4
> et <- exactTest(d)
> top<-topTags(et)
> top
Comparison of groups:  L-H 
                 logFC    logCPM      PValue       FDR
LOC101119889 -4.609729 10.669772 0.004828116 0.8883734
LOC101112298  1.983919  8.613781 0.201633226 1.0000000
LOC101121082 -1.577096 12.194037 0.257687080 1.0000000
PER2          1.556200 11.766520 0.265899128 1.0000000
GPT          -1.374615 12.820088 0.321135663 1.0000000
LOC101121333 -1.354515 10.613348 0.336138933 1.0000000
LOC101103050 -1.416573  8.305962 0.355318326 1.0000000
LOC101112300  1.420906  7.982203 0.380212348 1.0000000
LOC101110133 -1.174812 10.631434 0.406782402 1.0000000
LOC101108558 -1.141834 12.802543 0.407262691 1.0000000
> summary(de <- decideTestsDGE(et))
   [,1]
-1    0
0   184
1     0
> plotMDS(d)
Error in plotMDS.DGEList(d) : Need at least 3 columns of data

This is what the sessionInfo() gives

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.2.4  limma_3.16.8
edger • 6.0k views
ADD COMMENTlink modified 5.0 years ago by Devon Ryan86k • written 5.0 years ago by xiaojuhu13150

So what exactly is so "unnormal" about the results? Without replicates, it'd be odd to find much of anything.

ADD REPLYlink written 5.0 years ago by Devon Ryan86k

I'm using the same data with cufflinks, DESeq, edgeR to find the differential expression. The results I find the latter two have similar results,but totally don't overlap with cufflinks. So I think it may be edgeR problems, but I don't know the reason.

ADD REPLYlink written 5.0 years ago by xiaojuhu13150
1

I wouldn't expect any of the programs to have completely overlapping results, they use different algorithms. Why are you bothering doing a comparison when you have no replicates. The results are only questionably meaningful to begin with. I would not trust cuffdiff results if they're much different from those of DESeq.

ADD REPLYlink written 5.0 years ago by Devon Ryan86k

Thanks, but I just get two samples without for practice. And I'd like to know if the alignment rate affects the differential expression results. I used tophat 2.0.8b, and checked the tophat/logs [huxj@LoginNode logs]$cat bowtie.left_kept_reads.log 43476250 reads; of these: 43476250 (100.00%) were unpaired; of these: 18745980 (43.12%) aligned 0 times 19250394 (44.28%) aligned exactly 1 time 5479876 (12.60%) aligned >1 times 56.88% overall alignment rate

ADD REPLYlink written 5.0 years ago by xiaojuhu13150

This is single-ended then? That would seem to be a low overall alignment rate. Do you have a reference annotation or are you stuck aligning only to the genome? You'll generally get better results with an annotation file (GTF or GFF).

ADD REPLYlink written 5.0 years ago by Devon Ryan86k

Paired-ended data. But I'm not sure tophat results (logs file ) are a right way to check the alignment rate. Somebody said we should use the flagstat bam file to check, if in that way, the alignment rate is higher than 70%. How do you check the alignment rate?

ADD REPLYlink written 5.0 years ago by xiaojuhu13150

Yeah, you're better off using flagstat.

ADD REPLYlink written 5.0 years ago by Devon Ryan86k

OK, thanks. So I at least realise that the alignment rate is not the problem.

ADD REPLYlink written 5.0 years ago by xiaojuhu13150

It was difficult for me to read the data you pasted. However, it is not uncommon that most FDR values are equal to one. It depends on the number of tests and on pvalues. You can find details on computation of FDR here A: How to estimate the false discovery rate (FDR) using bootstrap

ADD REPLYlink written 5.0 years ago by Fabio Marroni2.0k

I'm sorry, I'm just a newbie to edgeR. Because there are no replicates, after the step (d<-estimateGLMCommonDisp(d,design,method="deviance",robust="TRUE",subset=NULL) I don't get the dispersion value, so I set it handly to 0.4 as suggested by edgeR manual.But I get only 10 differential expression genes, basically no significant one gene. But the cufflinks results have 69 significant genes. So my question is that if there are some problems with this edgeR analysis.

ADD REPLYlink written 5.0 years ago by xiaojuhu13150

I'm a newbie to edgeR, too. However, after a careful reading of your code, I think there might be a little misunderstanding of the manual, which led to your problem. As shown in code below, you directly set the dispersion value to 0.4,which actually should be given to BCV , since dispersion value is equal to BCV^2. In this way, your dispersion value is equal to 0.16, which will guarantee you a list of more DE genes.

> d$common.dispersion=0.4
> et <- exactTest(d)

According to solutions offered in chapter 2.11"What to do if you have no replicates", the standard way to manually set dispersion value includes 2 steps(,code below is copied directly from the manual,option 2):

> bcv <- 0.2
> counts <- matrix( rnbinom(40,size=1/bcv^2,mu=10), 20,2)
> y <- DGEList(counts=counts, group=1:2)
> et <- exactTest(y, dispersion=bcv^2)

In your case, you can try adding the first & last line into your script, and see if a larger number of DE genes are detected. Hope this can help.

ADD REPLYlink written 18 months ago by siyuanfeng.bioinfo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1238 users visited in the last hour