Question

Why 'filterByExpr' does not eliminate low read counts from the expression data?

1

Entering edit mode

3.9 years ago

svp ▴ 680

I estimated DEGs from RNASeq using edgeR method. But the final DEG list has several genes with low counts and are having high Fold change

         logFC   logCPM   F      PValue     FDR       Ctrl1  Ctrl2  Str1    Str2
Gene1   7.673   -2.658  33.203  1.27E-05    0.000118    0     0   0.738954  0.657007
Gene5   7.605   -2.951  26.996  3.36E-05    0.000255    0     0   0.674697  0.657007
Gene35  7.499   -2.072  22.724  8.47E-05    0.0034      0     0   0.642569  0.594435
Gene67  7.462   -2.151  36.849  2.49E-06    3.32E       0     0   0.642569  0.563149

Initialy, I filtered out low count genes using

keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

But I am not sure why genes normalized count value close to zero are retained and listed as DEGs. How can I solve this? I was thinking to use min.count parameter in filterByExpr. But, is that the right way? Suppose if I use, keep<-filterByExpr(y,group = group, min.count=500), the genes with greater than 1 normalized counts are retained.

To be more specific, where should I apply filterByExpr function, before normaliztion (ie on raw counts) or after normaliztion ?

Code:

group<-factor(paste(targets$Genotype,targets$Time,targets$Treatment,sep="."))
cbind(targets,Group=group)
y<-DGEList(counts = raw_counts, group = group)

#Filterout low count genes
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

#Data Normalization
y<-calcNormFactors(y)

#Design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)

#Dsipersion estimation
y<-estimateDisp(y,design)
fit<-glmQLFit(y,design)

edgeR RNA-Seq DEGs filterByExpr • 7.9k views

ADD COMMENT • link updated 3.9 years ago by Gordon Smyth ★ 7.0k • written 3.9 years ago by svp ▴ 680

1

Entering edit mode

3.9 years ago

Gordon Smyth ★ 7.0k

The filtering appears to have worked correctly. The genes you show have highy significant expression differences between groups, so they are not the sort of genes that one would want to remove from the analysis.

The purpose of filtering is to remove genes that have too few reads to establish any worthwhile significant differences.

ADD COMMENT • link 3.9 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Dear all,

As far as I know, CPM threshold 0.5 (corresponded to 10 counts) is recommended for a 20 million reads library, isn't it? Could you kindly clear me which CPM threshold will be used for filtering by filterByExpr function?

ADD REPLY • link 3.6 years ago by seta ★ 1.9k

0

Entering edit mode

See the help page help(filterByExpr).

ADD REPLY • link 3.6 years ago by Gordon Smyth ★ 7.0k

score 3 · Accepted Answer · 2020-05-18

Before normalization (or acutally it does not matter as filterByExp does not consider the normalization since it filters on counts).

What values do you have in the Ctrl1, Ctrl2, Str1 and Str2 coulmns? CPM? In that case the filtering probably works as gene1 have at least 0.657007 * Xe6 reads in two samples. If you have sequenced 40e6 reads that corresponds to at least ~26 reads in two samples.

That said - if you want a more stringent filtering you just have to apply it yourself.