CPM threshold for RNASeq count data
0
1
Entering edit mode
3.0 years ago
Will ▴ 10

Hi, I have a DGEList object created used edgeR. The dimension of this object is: 57820 - 1013. I have to choose the filtering and I am not sure that my choice is completely correct. The norm factor in x$samples are all 1 and the summary(x$samples$lib.size) is:

   Min.   1st Qu.   Median   Mean    3rd Qu.     Max. 
 6557050 31326322 36019156 35935285 40766618 79411964

I tried with keep.exprs <- rowSums(cpm(x)>0.4) >= 5 and keep.exprs <- filterByExpr(x). When I run x_filtered <- x[keep.exprs,] with the first one the total dimension becomes 52082 - 1013 while with the second one 24045 - 1013.

Which is the best filtering and why ?

Count RNASeq cpm filtering edgeR • 2.1k views
ADD COMMENT
0
Entering edit mode

filterByExpr is preferred as it filters based on group information and not arbitrarily on >= some integer value. You have 1013 samples? Be sure that your DGEList has a proper group information for the filter to be meaningful.

ADD REPLY
0
Entering edit mode

Looks like there are some arbitrary integers set in the filterByExpr call min.count=10 and min.total.count=15

ADD REPLY
1
Entering edit mode

I was referring to the filtering using the group information rather than setting a random value (here the 5 samples that need to have cpm above 0.4). These five samples could be randomly distributed over multiple groups but each group could still lack the power for that gene to be called significant, that is why I think that fBE is preferred. After all the aim of the filtering is to remove genes that inflate multiple-testing burden, therefore a strategy that respects the size of the groups makes sense to me. The thresholds you mention are probably debatable, I agree.

This function implements the filtering strategy that was intuitively described by Chen et al (2016). Roughly speaking, the strategy keeps genes that have at least min.count reads in a worthwhile number samples. More precisely, the filtering keeps genes that have count-per-million (CPM) above k in n samples, where k is determined by min.count and by the sample library sizes and n is determined by the design matrix.

n is essentially the smallest group sample size or, more generally, the minimum inverse leverage of any fitted value. If all the group sizes are larger than large.n, then this is relaxed slightly, but with n always greater than min.prop of the smallest group size (70% by default).

In addition, each kept gene is required to have at least min.total.count reads across all the samples.

ADD REPLY
0
Entering edit mode

Thanks! So you suggest to use keep.exprs <- filterByExpr(x, group=x$sample$group), where group in my case is the condition (Healthy or not) of my subjects?

ADD REPLY

Login before adding your answer.

Traffic: 2094 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6