Question

Deseq Vs Edger..Filtering Read Counts.

2

Entering edit mode

10.2 years ago

GouthamAtla 12k

The tutorials/bioconductor manuals about edgeR suggest to remove those genes does not that have at least 1 read per million in at least 'n' samples ( n = smallest group of samples). But the DESeq tutorials available doesn't include this step. Should we remove those genes or keep them in the DESeq data analysis pipeline ? This step drastically reduces the number of genes.

For example this review paper suggest to filter genes in edgeR but does not talk anything about DESeq.

http://www.bioconductor.org/help/course-materials/2013/CSAMA2013/tuesday/afternoon/DESeq_protocol.pdf

edger deseq rnaseq differential-expression rna-seq tophat2 • 13k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 10.2 years ago by GouthamAtla 12k

0

Entering edit mode

Hello Devon,

just few confusion need to resolve from your great experience.

Just wondering to compare DESeq2 and edgeR results of DE. But confusion is that…..In edgeR it is advisable to filter low tags by cpm counts before DE analysis while in DESeq2 its not. while The package DESeq2 consider independent fileting (via function results ) step that is not available for edgeR ??

so my query is that

it is necessary to take filtration step of edgeR in edgeR (or its just optional)??? or in both ???

case 1 ( followed default functions)

while comparing the results of DE gene of DESeq2 (without via edgeR of cpm just used default functions i.e. deseq and results ) and in edgeR (with filtration via cpm), i got results almost similar and that is what, i was expecting....

case 2 (applied filtration in both via cpm )

while applying same filtering of edgeR via cpm in DESeq2 and then used the function deseq and results (with and without independent filtering by both; just for check), results are not similar et all.....very far DE in both (taking 0.05 padj as cutoff).

So please clear my doubt and suggest me what to do ???

Thanks

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by bioinforupesh2009.au ▴ 140

0

Entering edit mode

edgeR suggests manual filtering simply because they never integrated the genefilter package into the edgeR package. Some of the authors of the genefilter package (this is what's used to perform independent filtering) also wrote DESeq2, so that's why there's really nice integration and automatic independent filtering.

Now that doesn't mean that you can't do the exact same independent filtering in edgeR. I believe in the DESeq (not DESeq2) vignette, that there's a section on how to perform independent filtering yourself, since this isn't done automatically in that package. You can apply the same principles to edgeR to and arrive at more equivalent results.

In either case, just compare genes with an adjusted p-value in both. Remember that DESeq2 will perform 2 types of filtering. Firstly, it performs independent filtering for power, which is what you're asking about. Secondly, it also filters genes where one of the samples has excessive leverage, since the fit is then unreliable. It's particularly interesting to see how edgeR treats these, since any significant findings in such genes are more likely to be false-positives.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

You mean, should I adopt independent filtering into edgeR without taking consideration of CPM filtration ??

I guess, its work for p_value and its not the same as CPM filtered low expressed genes in edgeR.

As mentioned in manual

The results function of the DESeq2 package performs independent ltering by default using the mean of normalized counts as a lter statistic. A threshold on the lter statistic is found which optimizes the number of adjusted p values lower than a significance level alpha (we use the standard variable name for signifcance level, though it is unrelated to the dispersion parameter).

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by bioinforupesh2009.au ▴ 140

0

Entering edit mode

Hello Devon

I am using edgeR for my RNA-Seq analysis. To filter my low counts I have been using rowSums(cpm(counts)>1) >=2. Could you explain why 1 count per million is the benchmark to filter lowly expressed genes? If possible, could you point to any literature discussing the filtering in depth?

Also, is it necessary to use genefilter for edgeR too? And when you mention adjusting the threshold per-exerpiment, on what factors should the adjustment be made?

Thanks

ADD REPLY • link 8.2 years ago by ygowtha ▴ 20

1

Entering edit mode

1 cpm is an arbitrary round number. You should tailor this for each dataset (or better yet, use the genefilter package, which has an associated publication describing it).

You never have to do the filtering, but doing so tend to yield higher power.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

score 7 · Answer 1 · 2014-02-19

Not only should you perform independent filtering, but the more recent versions of DESeq2 will do that for you automatically. For the underlying reasons, have a read through this paper, from Wolfgang Huber's group, which also produced DESeq (among other tools). See also the genefilter package in Bioconductor, which can be used in DESeq, edgeR, limma, and anything similar.

On a side note, the exact filtering done in the edgeR vignette is really just an example. I would recommend that you adjust the threshold per-experiment (the genefilter package is useful for this).