Question: Edger: Very Low P-Value And Very High Variance Within The Group Of Replicates. What'S My Problem??
gravatar for valentina
7.4 years ago by
valentina60 wrote:

I'm using edgeR in order to perform differential expression analysis from RNA-seq experiment.

I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups.

I ran the edgeR pakage like follow:

x <- read.delim("my_reads_count.txt", row.names="GENE")
group <- factor(c(1,1,1,2,2,2))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)    
et <- exactTest(y)

I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. This is an example for my_reads_count.txt:

GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3    
ENSG00000198842    0    3    2    2    6666    3
ENSG00000257017    3    3    25    2002    29080    4

And for my_edgeR_resulta.txt:

GENE                                         logFC        logCPM       PValue          FDR
ENSG00000198842              9.863211e+00  5.4879462930 5.368843e-07 1.953612e-04
ENSG00000257017                  9.500927e+00  7.7139869397 8.072384e-10 7.171947e-07

I would like that the variance within the group is considered. Does anyone may help me? Some suggestion?

ADD COMMENTlink modified 7.4 years ago by Steve Lianoglou5.1k • written 7.4 years ago by valentina60

Is your raw data normalized?

ADD REPLYlink written 7.4 years ago by Damian Kao15k

The raw data refers to the count of reads mapping within the exons (data obtained running htseq-count). The normalization is performed with calcNormFactors(y). Am I correct?

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by valentina60
gravatar for Steve Lianoglou
7.4 years ago by
Steve Lianoglou5.1k
Steve Lianoglou5.1k wrote:

The variance is considered, but your signal is apparently a lot higher than the variance.

These two genes have monster-levels of expression in your "Group 2" -- you're looking at a locus with less than 10 reads in group 1, and thousands to tens-of-thousands of reads in group 2.

Are the library sizes wildly different between samples?

You might consider filtering out genes that do no exhibit minimal expression in at least 3 samples, which should remove your first gene (ENSG00000198842 ), and possibly your second gene.

ADD COMMENTlink written 7.4 years ago by Steve Lianoglou5.1k

Have I to remove these genes before the edgeR analysis? Or after??

ADD REPLYlink written 7.4 years ago by valentina60

You should remove them before you start the DGEList function. I don't think there's a standard way of figuring out the cutoff. However, the edgeR users guide mentions multiple ways to go about removing genes based upon low expression levels.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Jason900
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour