Question

Confusing edgeR result for reads without count

0

Entering edit mode

7.0 years ago

polag03 • 0

Hi all, Please I need clarification on this output from edgeR on differential expression analysis. Two samples with three replicates each are being compared. The output is confusing as these genes are reported by edgeR as being regulated (down) whereas the counts of the genes in the count matrix for all the three replicates of the second sample are 0. How can this be? How does edgeR calculate log of fold change? how does edgeR handle this situation?

A sample output is as shown below

ATTTTGGCGATCCAACACTTAAGT        -6.88592754685162       -0.45091919423578       0.000288587012249043    0.00161769069658904     0.729454376591938       -1      8       8       19      0       0       0
CCAGTGGTCTCGTGGTAGAATA  -7.69674781160641       0.229731373817187       2.42554724140601e-05    0.000203815577681898    0.691945707994644       -1      21      29      7       0       0       0
GCGACCCAACACTTAAGTCCC   -7.45304418170364       0.0251105405118642      0.000127331469322578    0.000825658411235952    0.816342020435996       -1      17      3       34      0       0       0

differential expression edgeR count matrix • 2.3k views

ADD COMMENT • link 7.0 years ago by polag03 • 0

0

Entering edit mode

What is your null hypothesis? If you set the default (LFC = 0) then a LFC of 0.01 is already a differential expression for the algorithm.

Rule of thumb, even if there aren't any in this field, is:

LFC = 0.5 corresponds to 50% change in gene expression. It could be a good threshold to determine true DEGs, however thresholds are only a lie so use with caution :D

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thank you Macspider for the quick response. The null hypothesis is that the genes are not differentially expressed in the two samples.

However, that is not why I am confused.

The last 6 columns of the output are the raw counts of the genes in the two samples ( 3 replicates each). From the output, it is clear that there were no counts for the three sequences in the second sample for the 3 replicates (meaning the sequences are present in sample A but not in sample B). Yet, a LFC of -6.88592754685162, -7.69674781160641 and -7.45304418170364 were obtained as shown in column 2. Since LFC is a ratio, I would expect the LFC for these to be infinity or at least not a figure.

Please how does edgeR calculate the LFC between two samples with multiple replicates? i guess this will help shed some light into this. Thank you

ADD REPLY • link 7.0 years ago by polag03 • 0

1

Entering edit mode

There is the concept of pseudocounts that is important to understand here:

https://support.bioconductor.org/p/32751/

You can find a clear description of how they do it also in the EdgeR manual:

https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Just open it and CTRL+F "pseudocounts" ;)

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Ah, I missed that! I had gone through the manual before posting the question. let me just make some confirmations in my data. I will revert.

Thanks a lot Macspider

ADD REPLY • link 7.0 years ago by polag03 • 0

0

Entering edit mode

You're welcome! If you realize that the answer solved your problem, marked it as "answer" so the thread gets bookmarked as closed and future readers will benefit from it. :)

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I have checked with the input data, the count matrix that was passed as input to edgeR is the same as represented in the output. The output counts are not pseudocounts.

ADD REPLY • link 7.0 years ago by polag03 • 0

1

Entering edit mode

Pseudocounts are added by EdgeR during the calculation but are not shown at the end. It's just a mathematical trick to avoid ending in the no fly zone for Logarithms!

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Okay I see. Just one more thing... how is the LFC calculated for the three replicates? Is the average of the counts for each sample taken and used to compute the ratio (like average of rep1, rep2 and rep3 computed for each sample and used), or is there any other formula used? what is it?

Thanks a lot for your help

ADD REPLY • link 7.0 years ago by polag03 • 0

0

Entering edit mode

As far as I know, it computes the average between replicates. However, it strongly depends on the design variable. The number of replicates will also affect the p-value.

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Another comment on something you mentioned here:

The null hypothesis is that the genes are not differentially expressed in the two samples.

This is the empirical null hypothesis, but the differential expression function has certainly a default value to determine IF they are differentially expressed or not. I am more of a DESeq2 user, but they are definitely compatible as examples:

If I test against the null hypothesis "abs(LFC)>0.5" I will have a different list of DEGs, with respect to testing against "abs(LFC)!=0". This because I fixed a different function to be satisfied by the testing. Usually, the default is LFC!=0, so you get differentially expressed genes which actually have a LFC of 0.001 or something like that.

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k