filter out gene with zero counts in differential gene analysis
Entering edit mode
2.2 years ago
tommy ▴ 30


I have a question regarding the filtering process in gene analysis. My dataset consists of 8 samples of each 3 treatments (total 24 samples). For each sample, 10000 genes were collected and the corresponding counts number were recorded. Since I was interested in analysis in log2 counts in the future analysis, I am intended to remove genes with zero counts. My intended steps:

  1. Remove all the genes with at least one zero count in 24 samples.

  2. Using filterByExpr filter out genes with low counts. This step removes genes of low counts according to CPM.

The reasons are:

  1. Filter all the genes with zero counts: If zero counts are left, it turns -Inf after log2 transformation. It's really bad for future analysis. log2 are considered as a biological relevant change. It makes no sense to do log2(counts +1).
  2. I should filter out genes by considering 24 samples together. Since the gene expression counts in different treatments will be compared in the future analysis. If I remove the genes in 8 sample bases, some genes may only appear in treatment 1 and not in treatment 2 which makes it impossible to compare.

I am new to the field. Please help me if I am on the right track. Thank you.

edgeR • 1.2k views
Entering edit mode
2.2 years ago
ATpoint 81k

That is a really bad strategy. Eliminating genes with any zeros will remove genuine biological signal as a gene can be off in one but active in another group. Let experts software such as edgeR handle the differential analysis and CPM calculation. SImply adding a pseudocount of 1 avoids logs of zeros. filterByExpr alone is sufficient. Be sure to follow the manual if you're inexperienced and only do custom approaches if you have the required background. All you need for standard analysis is in the manual, please read and apply it. Counts of zero are normal and will be properly handled by edgeR (or any other established DE software). I hope you did not plan to make any custom statistics with these logcounts anyway, that is most likely going to be suboptimal. Just use edgeR as instructed in the manual.

For downstram analysis, if you need logcounts use edgeR::cpm(y, log=TRUE), again, see manual.

Entering edit mode

Thanks for your help. I'll stick to the manual.


Login before adding your answer.

Traffic: 2730 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6