I've samples under two different conditions (3 biological replicates each) that I'm about to analyse for differential gene expression. I've obtained read counts over all exons for each gene for all replicates. Now, I've always had trouble with filtering a set of genes which are "not expressed (enough)" before passing the counts to DESeq or edgeR. For example, edgeR uses the function
cpm (counts per million) to discard those genes whose
cpm > threshold (usually set to 1 or 2).
However, I am not convinced by this method as it doesn't take gene length into account to perform gene expression quantification. That is, for
cpm, we are comparing different genes within the same sample as opposed to same gene against different samples as in a typical DGE analysis. The RNA-seq data I work with are ribosomal RNA depleted libraries, meaning they contain ncRNAs, snRNAs etc... in addition to mRNAs. If gene expression is measured as a factor of just read counts per million, then filtering by
cpm will be biased to those genes that are just longer. As an example, between a ncRNA with 500 reads and length of 400 base pairs transcript length and an mRNA 1000 reads and 3000 base pairs transcript length, it's clear that ncRNA has more coverage per base. However, this information would be lost if one were to consider just 500 vs 1000 reads (which is what
So, what I did was to compute transcript length for each gene and calculate the RPKM for each replicate. Then, I retained those genes only when
RPKM value for each gene over all replicates were >= 1 in at least any 3 (out of 6) replicates. The RPKM is computed just for filtering genes based on "expression" values, not for subsequent DGE analysis with DESeq or edgeR.
I insist on doing this because the NB model tends to give a lot of "significant" results with "low read counts" and I want to avoid this as much as possible.
Basically, I'd like to hear biostar community's thoughts on this. Is this a normal practice? I find it's a better practice than say
cpm based filtering? Are there better approaches? Are there disadvantages to this mode of filtering (which I can't seem to come up with)?