Why 'filterByExpr' does not eliminate low read counts from the expression data?
2
1
Entering edit mode
3.9 years ago
svp ▴ 680

I estimated DEGs from RNASeq using edgeR method. But the final DEG list has several genes with low counts and are having high Fold change

         logFC   logCPM   F      PValue     FDR       Ctrl1  Ctrl2  Str1    Str2
Gene1   7.673   -2.658  33.203  1.27E-05    0.000118    0     0   0.738954  0.657007
Gene5   7.605   -2.951  26.996  3.36E-05    0.000255    0     0   0.674697  0.657007
Gene35  7.499   -2.072  22.724  8.47E-05    0.0034      0     0   0.642569  0.594435
Gene67  7.462   -2.151  36.849  2.49E-06    3.32E       0     0   0.642569  0.563149

Initialy, I filtered out low count genes using

keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

But I am not sure why genes normalized count value close to zero are retained and listed as DEGs. How can I solve this? I was thinking to use min.count parameter in filterByExpr. But, is that the right way? Suppose if I use, keep<-filterByExpr(y,group = group, min.count=500), the genes with greater than 1 normalized counts are retained.

To be more specific, where should I apply filterByExpr function, before normaliztion (ie on raw counts) or after normaliztion ?

Code:

group<-factor(paste(targets$Genotype,targets$Time,targets$Treatment,sep="."))
cbind(targets,Group=group)
y<-DGEList(counts = raw_counts, group = group)

#Filterout low count genes
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

#Data Normalization
y<-calcNormFactors(y)

#Design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)

#Dsipersion estimation
y<-estimateDisp(y,design)
fit<-glmQLFit(y,design)
edgeR RNA-Seq DEGs filterByExpr • 7.9k views
ADD COMMENT
3
Entering edit mode
3.9 years ago

Before normalization (or acutally it does not matter as filterByExp does not consider the normalization since it filters on counts).

What values do you have in the Ctrl1, Ctrl2, Str1 and Str2 coulmns? CPM? In that case the filtering probably works as gene1 have at least 0.657007 * Xe6 reads in two samples. If you have sequenced 40e6 reads that corresponds to at least ~26 reads in two samples.

That said - if you want a more stringent filtering you just have to apply it yourself.

ADD COMMENT
1
Entering edit mode
3.9 years ago
Gordon Smyth ★ 7.0k

The filtering appears to have worked correctly. The genes you show have highy significant expression differences between groups, so they are not the sort of genes that one would want to remove from the analysis.

The purpose of filtering is to remove genes that have too few reads to establish any worthwhile significant differences.

ADD COMMENT
0
Entering edit mode

Dear all,

As far as I know, CPM threshold 0.5 (corresponded to 10 counts) is recommended for a 20 million reads library, isn't it? Could you kindly clear me which CPM threshold will be used for filtering by filterByExpr function?

ADD REPLY
0
Entering edit mode

See the help page help(filterByExpr).

ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6