Question: Deseq Vs Edger..Filtering Read Counts.
2
gravatar for geek_y
4.9 years ago by
geek_y8.8k
geek_y8.8k wrote:

The tutorials/bioconductor manuals about edgeR suggest to remove those genes does not that have at least 1 read per million in at least 'n' samples ( n = smallest group of samples). But the DESeq tutorials available doesn't include this step. Should we remove those genes or keep them in the DESeq data analysis pipeline ? This step drastically reduces the number of genes.

For example this review paper suggest to filter genes in edgeR but does not talk anything about DESeq.

http://www.bioconductor.org/help/course-materials/2013/CSAMA2013/tuesday/afternoon/DESeq_protocol.pdf

ADD COMMENTlink modified 4.6 years ago by bioinforupesh2009.au100 • written 4.9 years ago by geek_y8.8k

Hello Devon,
just few confusion need to resolve from your great experience.

Just wondering to compare DESeq2 and edgeR results of DE. But confusion is that…..In edgeR it is advisable to filter low tags by cpm counts before DE analysis while in DESeq2 its not. while The package DESeq2 consider independent fileting (via function results ) step that is not available for edgeR ??

so my query is that
1) it is necessary to take filtration step of edgeR in edgeR (or its just optional)??? or in both ???


#####case 1 ( followed default functions)
 while comparing the results of DE gene of DESeq2 (without via edgeR of cpm just used default functions i.e. deseq and results ) and in edgeR (with filtration via cpm), i got results almost similar and that is what, i was expecting….
####case 2 (applied filtration in both via cpm )
while applying same filtering of edgeR via cpm in DESeq2 and then used the function deseq and results (with and without independent filtering by both; just for check), results are not similar et all…..very far DE in both (taking 0.05 padj as cutoff).

So please clear my doubt and suggest me what to do ???

Thanks

 

ADD REPLYlink written 4.6 years ago by bioinforupesh2009.au100

edgeR suggests manual filtering simply because they never integrated the genefilter package into the edgeR package. Some of the authors of the genefilter package (this is what's used to perform independent filtering) also wrote DESeq2, so that's why there's really nice integration and automatic independent filtering.

Now that doesn't mean that you can't do the exact same independent filtering in edgeR. I believe in the DESeq (not DESeq2) vignette, that there's a section on how to perform independent filtering yourself, since this isn't done automatically in that package. You can apply the same principles to edgeR to and arrive at more equivalent results.

In either case, just compare genes with an adjusted p-value in both. Remember that DESeq2 will perform 2 types of filtering. Firstly, it performs independent filtering for power, which is what you're asking about. Secondly, it also filters genes where one of the samples has excessive leverage, since the fit is then unreliable. It's particularly interesting to see how edgeR treats these, since any significant findings in such genes are more likely to be false-positives.

ADD REPLYlink written 4.6 years ago by Devon Ryan87k

You mean, should i adopt independent filtering into edgeR without taking consideration of CPM  filtration ??

i guess, its work for p_value and its not the same as CPM filtered low expressed genes in edgeR.

As mentioned in manual 'The results function of the DESeq2 package performs independent ltering by default using the mean of normalized counts as a lter statistic. A threshold on the lter statistic is found which optimizes the number of adjusted p values lower than a signifcance level alpha (we use the standard variable name for signifcance level, though it is unrelated to the dispersion parameter ).'

ADD REPLYlink written 4.6 years ago by bioinforupesh2009.au100

Hello Devon

I am using edgeR for my RNA-Seq analysis. To filter my low counts I have been using rowSums(cpm(counts)>1) >=2. Could you explain why 1 count per million is the benchmark to filter lowly expressed genes? If possible, could you point to any literature discussing the filtering in depth?

Also, is it necessary to use genefilter for edgeR too? And when you mention adjusting the threshold per-exerpiment, on what factors should the adjustment be made?

Thanks

ADD REPLYlink written 2.9 years ago by ygowtha20
1

1 cpm is an arbitrary round number. You should tailor this for each dataset (or better yet, use the genefilter package, which has an associated publication describing it).

You never have to do the filtering, but doing so tend to yield higher power.

ADD REPLYlink written 2.9 years ago by Devon Ryan87k
7
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan87k
Freiburg, Germany
Devon Ryan87k wrote:

Not only should you perform independent filtering, but the more recent versions of DESeq2 will do that for you automatically. For the underlying reasons, have a read through this paper, from Wolfgang Huber's group, which also produced DESeq (among other tools). See also the genefilter package in Bioconductor, which can be used in DESeq, edgeR, limma, and anything similar.

On a side note, the exact filtering done in the edgeR vignette is really just an example. I would recommend that you adjust the threshold per-experiment (the genefilter package is useful for this).

ADD COMMENTlink written 4.9 years ago by Devon Ryan87k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1230 users visited in the last hour