Question: Too Many Deg With EdgeR Output?
gravatar for biotech
7.1 years ago by
United States
biotech540 wrote:

Hi there,

I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. Two samples to compare and no replicates. Reads were generated with Ion Torrent PGM using 316 chip. One for each sample and performed in different days.

Since I had too many differentially expressed genes, ┬┐Should I be more conservative assigning edgeR dispersion value? Also, there are considerable more up-regulated genes in exvivo than in plate sample.

See logFC_vs_logCPM figure:

Thanks for you help, Bernardo

  • tmap code:

tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose

tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose

  • Flasgstat:


3240242 + 0 in total (QC-passed reads + QC-failed reads)

2132481 + 0 mapped (65.81%:nan%)


3774075 + 0 in total (QC-passed reads + QC-failed reads)

3510438 + 0 mapped (93.01%:nan%)

  • count:

    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID exvivo.sam HPNK.gff > exvivo.counts

    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID plate.sam HPNK.gff > plate.counts

  • count stats:

ex-vivo stats

no_feature 777946

ambiguous 1

too_low_aQual 0

not_aligned 1107761

alignment_not_unique 0

plate stats

no_feature 776707

ambiguous 47

too_low_aQual 0

not_aligned 263637

alignment_not_unique 0

  • edgeR code:


files <- dir(pattern="*\.counts$")

RG <- readDGE(files, header=FALSE)


keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one count per million (cpm) in at least three samples

RG <- RG[keep,]


RG <- calcNormFactors(RG)



bcv <- 0.2 #Assigned dispersion value of 0.2

m <- as.matrix(RG)

d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample number. Also can be adapted to replicated samples, see'?DGEList'.


et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, pair=(1:2),dispersion=bcv^2)


top <- topTags(et)


cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes:

summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR is given by'decideTestsDGE'.


-1 200

0 1176

1 769

Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 are down-regulated.

detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% FDR

plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% FDR

abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes.

title("plate vs ex-vivo")

dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf##

ADD COMMENTlink modified 6.6 years ago • written 7.1 years ago by biotech540
gravatar for Sean Davis
7.1 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

First, you simply need to do replicates. There is no way around this fact if you want a robust, reproducible result. That said, a gene list can be ranked by FDR, so you are free to choose an FDR of 0.0001 or 0.01, or 0.1 if you like to generate a list of reasonable size; 0.05 is not a magical value. If you choose not to do replicates, then plan on a lot of PCR to validate your findings.

ADD COMMENTlink written 7.1 years ago by Sean Davis26k

How many PCRs should I perform if I don't do replicates?

ADD REPLYlink written 7.1 years ago by biotech540

When publishing, is it possible to avoid PCR validation having some replicates or is always necessary to do some PCRs?

ADD REPLYlink written 7.1 years ago by biotech540

There is no rule here. The important message is that to believe data, one needs some level of replication. How important a role PCR will play depends on the experimental system and hypotheses being tested. The goal of both PCR and replication is to reduce the cost of false positive results; without one or the other (or both), the cost could be high.

ADD REPLYlink written 7.1 years ago by Sean Davis26k
gravatar for Irsan
6.6 years ago by
Irsan7.2k wrote:
The amount of DEGs you get out is reallly dependent on the dispersion you manually put in. Have you put all your samples in one group, only keep 250 housekeeping genes and estimate the common dispersion? That dispersion estimate is the one you should put in. Also have a look at the edgeR manual to see a more detailed explanation what to do when you have no replicates (which is a bad idea already mentioned by Sean. If budget was the reason for not including replicates, its better to reduce sequencing yield per sample and include replicates).
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Irsan7.2k

Hi Irsan, thank you for your answer, it's really helpful, specially because you pointed out the detail about how to pick the dispersion value. I'm much obliged to you.

ADD REPLYlink written 6.6 years ago by biotech540
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2027 users visited in the last hour