Question: Why 'filterByExpr' does not eliminate low read counts from the expression data?
1
gravatar for svp
7 weeks ago by
svp100
Bangalore
svp100 wrote:

I estimated DEGs from RNASeq using edgeR method. But the final DEG list has several genes with low counts and are having high Fold change

         logFC   logCPM   F      PValue     FDR       Ctrl1  Ctrl2  Str1    Str2
Gene1   7.673   -2.658  33.203  1.27E-05    0.000118    0     0   0.738954  0.657007
Gene5   7.605   -2.951  26.996  3.36E-05    0.000255    0     0   0.674697  0.657007
Gene35  7.499   -2.072  22.724  8.47E-05    0.0034      0     0   0.642569  0.594435
Gene67  7.462   -2.151  36.849  2.49E-06    3.32E       0     0   0.642569  0.563149

Initialy, I filtered out low count genes using

keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

But I am not sure why genes normalized count value close to zero are retained and listed as DEGs. How can I solve this? I was thinking to use min.count parameter in filterByExpr. But, is that the right way? Suppose if I use, keep<-filterByExpr(y,group = group, min.count=500), the genes with greater than 1 normalized counts are retained.

To be more specific, where should I apply filterByExpr function, before normaliztion (ie on raw counts) or after normaliztion ?

Code:

group<-factor(paste(targets$Genotype,targets$Time,targets$Treatment,sep="."))
cbind(targets,Group=group)
y<-DGEList(counts = raw_counts, group = group)

#Filterout low count genes
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]

#Data Normalization
y<-calcNormFactors(y)

#Design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)

#Dsipersion estimation
y<-estimateDisp(y,design)
fit<-glmQLFit(y,design)
rna-seq edger degs filterbyexpr • 185 views
ADD COMMENTlink modified 7 weeks ago by Gordon Smyth1.7k • written 7 weeks ago by svp100
2
gravatar for kristoffer.vittingseerup
7 weeks ago by
European Union
kristoffer.vittingseerup3.3k wrote:

Before normalization (or acutally it does not matter as filterByExp does not consider the normalization since it filters on counts).

What values do you have in the Ctrl1, Ctrl2, Str1 and Str2 coulmns? CPM? In that case the filtering probably works as gene1 have at least 0.657007 * Xe6 reads in two samples. If you have sequenced 40e6 reads that corresponds to at least ~26 reads in two samples.

That said - if you want a more stringent filtering you just have to apply it yourself.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by kristoffer.vittingseerup3.3k
1
gravatar for Gordon Smyth
7 weeks ago by
Gordon Smyth1.7k
Australia
Gordon Smyth1.7k wrote:

The filtering appears to have worked correctly. The genes you show have highy significant expression differences between groups, so they are not the sort of genes that one would want to remove from the analysis.

The purpose of filtering is to remove genes that have too few reads to establish any worthwhile significant differences.

ADD COMMENTlink written 7 weeks ago by Gordon Smyth1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1664 users visited in the last hour