Question

Filtering Genes For Differential Gene Expression Analysis

5

Entering edit mode

12.1 years ago

Arun 2.4k

Hello,

I've samples under two different conditions (3 biological replicates each) that I'm about to analyse for differential gene expression. I've obtained read counts over all exons for each gene for all replicates. Now, I've always had trouble with filtering a set of genes which are "not expressed (enough)" before passing the counts to DESeq or edgeR. For example, edgeR uses the function cpm (counts per million) to discard those genes whose cpm > threshold (usually set to 1 or 2).

However, I am not convinced by this method as it doesn't take gene length into account to perform gene expression quantification. That is, for cpm, we are comparing different genes within the same sample as opposed to same gene against different samples as in a typical DGE analysis. The RNA-seq data I work with are ribosomal RNA depleted libraries, meaning they contain ncRNAs, snRNAs etc... in addition to mRNAs. If gene expression is measured as a factor of just read counts per million, then filtering by cpm will be biased to those genes that are just longer. As an example, between a ncRNA with 500 reads and length of 400 base pairs transcript length and an mRNA 1000 reads and 3000 base pairs transcript length, it's clear that ncRNA has more coverage per base. However, this information would be lost if one were to consider just 500 vs 1000 reads (which is what cpm does).

So, what I did was to compute transcript length for each gene and calculate the RPKM for each replicate. Then, I retained those genes only when RPKM value for each gene over all replicates were >= 1 in at least any 3 (out of 6) replicates. The RPKM is computed just for filtering genes based on "expression" values, not for subsequent DGE analysis with DESeq or edgeR.

I insist on doing this because the NB model tends to give a lot of "significant" results with "low read counts" and I want to avoid this as much as possible.

Basically, I'd like to hear biostar community's thoughts on this. Is this a normal practice? I find it's a better practice than say cpm based filtering? Are there better approaches? Are there disadvantages to this mode of filtering (which I can't seem to come up with)?

Thank you.

deseq edger rpkm • 15k views

ADD COMMENT • link updated 12.1 years ago by Damian Kao 16k • written 12.1 years ago by Arun 2.4k

score 3 · Answer 1 · 2013-06-04

The problem with filtering reads based on raw or normalized tag count is the underlying assumption you are making about these reads. Are you discarding lowly covered transcripts due to technical reason or are you discarding lowly expressed transcripts due to biological condition. If it is the latter case, you probably want to keep it.

One thing you can do before you filter is to generate a rarification graph where you plot increasing mapped subsets of your reads vs number of transcripts with at least X number of tags. If you reach saturation pretty early, then I think the latter case of lowly expressed transcripts is more likely and I would filter based on raw tag count. If you haven't reached saturation, then it is probably due to low coverage and I would filter based on normalized tag count (RPKM, DEseq, TMM, whatever normalization you want to use).