Question: edgeR understanding statistics
0
5 months ago by
vm.higareda10
vm.higareda10 wrote:

I am confused about normalization and statistics behind DE programs, I am using edgeR to analize two condittions.

Example for a gene ( raw-counts) four replicates by condition. Control (C) and treatment (T) of a gene:

gene= FBgn0034710

Controles = 820-1618-1728-1007

Tratamientos= 7195-1252-1312-1291

Result of edgeR

logFC =1.10 logCPM = 6.5 LR = 9.77 PValue = 0.0017 FDR= 0.02

Why FBgn0034710 gene is statistically significant if one replicate has a lot of raw counts (7915) in comparation with the others. I know that library size could be a factor but this is similar in the other replicates

modified 5 months ago by Kevin Blighe15k • written 5 months ago by vm.higareda10
1

Try taking out such outliers within a group and rerun the statistical test. I do not think edgeR has any mechanism to prune such data. One should filter out such discrepancies at expression level within group and across groups and then feed the data to edgeR.

0
5 months ago by
Kevin Blighe15k
London / Brazil
Kevin Blighe15k wrote:

Hey,

It's significant because the difference in means will be great due to that single outlier. However, you should note that the log fold change (logFC) is just 1.10... Therefore, I would not consider this gene at all for downstream analyses. Usually we use a combination of both FDR Q value (i.e. FDR-adjusted P values) and logFC for filtering genes for statistical significance.

Hope that this helps.

Yes it was useful, thank you for your answer. I am still confused why this kind of programs do not take in account outlier replicates

You could try DESeq2, which does deal with outliers. I have not used edgeR.

cpad's suggestion (above) to remove outliers is valid only if the sample is a genuinely problematic sample whose values are not related to the biological condition being studied.

1

edgeR's problem with outliers is an age old record (https://support.bioconductor.org/p/45417/) and some of the people shifted to DESeq2 for the same reason (https://support.bioconductor.org/p/89526/ ), valid or not. Few suggestions were to filter out outliers either programmatically (median) or manually. An addition to edgeR is discussed in this paper to handle outliers. I guess trying with DEseq2 should throw some light on this issue.