EdgeR: Filtering Counts Relationship to Sigficance.
1
0
Entering edit mode
5.0 years ago
bio.erikson ▴ 20

When I filter my count data with the code in the user guide, the FDR for all my genes drops to 1.0. But, if I lower the CPM cutoff then my FDR creeps up and start to get significant DE genes. The manual states that the CPM is a user preference. And it would seem to be influenced by the library size and number of genes. How does one go about picking the correct cutoff without bias.

My experiment is in Xenopus Laevis, where I have three conditions and four biological replicates. Each sample has a mean count of 8809887.

Running with Filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
keep <- rowSums(cpm(dge)>1) >= 2
dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)

Coefficient:  Ploidy3 
               logFC   logCPM        F       PValue       FDR
gene41545  2.0965805 2.448327 35.90115 4.746371e-05 0.9999619
gene49995 -0.7932641 4.473388 26.52267 1.941914e-04 0.9999619
gene15477  2.8440399 4.021164 26.31673 2.010534e-04 0.9999619
gene4260   1.1824923 4.668519 22.27339 4.144636e-04 0.9999619
gene50916  1.1398852 2.681193 19.93906 6.556942e-04 0.9999619
gene16190  1.6979906 5.217549 19.88785 6.625808e-04 0.9999619
gene16461  1.4116720 3.029733 17.98760 9.888933e-04 0.9999619
gene30142  2.2028513 1.304547 17.88017 1.012326e-03 0.9999619
gene9327   1.6603994 3.380958 17.84182 1.020847e-03 0.9999619
gene41018  0.5355154 4.970458 16.86752 1.267972e-03 0.9999619

Running w/o filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
#keep <- rowSums(cpm(dge)>1) >= 2
#dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)

Coefficient:  Ploidy3 
              logFC   logCPM         F       PValue          FDR
gene15477 2.8412637 4.012968 147.49207 6.236814e-34 3.189070e-29
gene16190 1.6964975 5.212556 109.25227 1.443137e-25 3.689595e-21
gene42380 1.7616811 4.875927 101.21749 8.312275e-24 1.416772e-19
gene18119 1.8589354 4.609218 100.35729 1.283117e-23 1.640241e-19
gene14278 1.2991257 6.502732  99.87230 1.638995e-23 1.676135e-19
gene45604 1.0203924 7.505369  72.36337 1.797921e-17 1.532218e-13
gene4166  0.9549254 7.469934  61.14997 5.305508e-15 3.875522e-11
gene46483 0.8449512 9.177001  60.20210 8.586239e-15 5.488002e-11
gene18489 2.2123988 2.705238  56.55142 5.489566e-14 3.118867e-10
gene13964 1.3560655 4.548504  55.43617 9.679432e-14 4.949384e-10

Gene CPM in each library Boxplots of Gene CPM per library.

RNA-Seq edgeR • 2.8k views
ADD COMMENT
1
Entering edit mode

The edgeR authors answer questions promptly on Bioconductor and prefer that the same question not to be cross-posted to other forums at the same time, because it takes more of our time to monitor and because questions may get left hanging. I answered your question on Bioconductor about an hour after you asked it.

If you find you do need to cross-post, please explain that as part of your question, otherwise people who take the time to answer your question may find afterwards that they've wasted their time.

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6