Question: Why 'filterByExpr' does not eliminate low read counts from the expression data?
1
gravatar for svp
6 months ago by
svp240
Bangalore
svp240 wrote:

I estimated DEGs from RNASeq using edgeR method. But the final DEG list has several genes with low counts and are having high Fold change

         logFC   logCPM   F      PValue     FDR       Ctrl1  Ctrl2  Str1    Str2
Gene1   7.673   -2.658  33.203  1.27E-05    0.000118    0     0   0.738954  0.657007
Gene5   7.605   -2.951  26.996  3.36E-05    0.000255    0     0   0.674697  0.657007
Gene35  7.499   -2.072  22.724  8.47E-05    0.0034      0     0   0.642569  0.594435
Gene67  7.462   -2.151  36.849  2.49E-06    3.32E       0     0   0.642569  0.563149

Initialy, I filtered out low count genes using

keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

But I am not sure why genes normalized count value close to zero are retained and listed as DEGs. How can I solve this? I was thinking to use min.count parameter in filterByExpr. But, is that the right way? Suppose if I use, keep<-filterByExpr(y,group = group, min.count=500), the genes with greater than 1 normalized counts are retained.

To be more specific, where should I apply filterByExpr function, before normaliztion (ie on raw counts) or after normaliztion ?

Code:

group<-factor(paste(targets$Genotype,targets$Time,targets$Treatment,sep="."))
cbind(targets,Group=group)
y<-DGEList(counts = raw_counts, group = group)

#Filterout low count genes
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

#Data Normalization
y<-calcNormFactors(y)

#Design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)

#Dsipersion estimation
y<-estimateDisp(y,design)
fit<-glmQLFit(y,design)
rna-seq edger degs filterbyexpr • 812 views
ADD COMMENTlink modified 6 months ago by Gordon Smyth2.0k • written 6 months ago by svp240
3
gravatar for kristoffer.vittingseerup
6 months ago by
European Union
kristoffer.vittingseerup3.4k wrote:

Before normalization (or acutally it does not matter as filterByExp does not consider the normalization since it filters on counts).

What values do you have in the Ctrl1, Ctrl2, Str1 and Str2 coulmns? CPM? In that case the filtering probably works as gene1 have at least 0.657007 * Xe6 reads in two samples. If you have sequenced 40e6 reads that corresponds to at least ~26 reads in two samples.

That said - if you want a more stringent filtering you just have to apply it yourself.

ADD COMMENTlink modified 6 months ago • written 6 months ago by kristoffer.vittingseerup3.4k
1
gravatar for Gordon Smyth
6 months ago by
Gordon Smyth2.0k
Australia
Gordon Smyth2.0k wrote:

The filtering appears to have worked correctly. The genes you show have highy significant expression differences between groups, so they are not the sort of genes that one would want to remove from the analysis.

The purpose of filtering is to remove genes that have too few reads to establish any worthwhile significant differences.

ADD COMMENTlink written 6 months ago by Gordon Smyth2.0k

Dear all,

As far as I know, CPM threshold 0.5 (corresponded to 10 counts) is recommended for a 20 million reads library, isn't it? Could you kindly clear me which CPM threshold will be used for filtering by filterByExpr function?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by seta1.4k

See the help page help(filterByExpr).

ADD REPLYlink written 8 weeks ago by Gordon Smyth2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour