Question: Filtering out low expressed genes in RNA-Seq data
gravatar for lessismore
17 months ago by
lessismore570 wrote:


Which is the most wise way to filter out low expressed genes (TPM) from a RNA-seq dataset ? Ive seen some empirically based methods that did not totally convince me.

Whats your opinion about it? thanks in advance

rna-seq tpm preprocessing • 3.3k views
ADD COMMENTlink modified 17 months ago by i.sudbery3.8k • written 17 months ago by lessismore570

I have tried TPM, RPKM both. There was bias wrt gene size. I wonder if more filters are required.

ADD REPLYlink written 17 months ago by Satyajeet Khare1.3k

Can you be more specific?

ADD REPLYlink written 17 months ago by lessismore570

1 fpkm is a standard filter.

ADD REPLYlink written 17 months ago by Pappu1.9k
gravatar for i.sudbery
17 months ago by
Sheffield, UK
i.sudbery3.8k wrote:

I depends on what your downstream analysis is. If your aim is to filter low expressed genes to increase power in a differential expression analysis, I recommend reading

Data-driven hypothesis weighting increases detection power in genome-scale multiple testing

If you want to divide genes into expressed and non-expressed for a biological reason, there there really isn't a good way to do it. I rule of thumb might be:

There are about 200,000 transcript molecules in a cell at anyone time (very approximate, order of magnitude type estimate). THus a TPM of 5 represents about 1 transcript per cell (average).

If you are interested if read counts is above background noise (e.g. perhaps they are contaminating DNA molecules in your library preps), you could use the method described here.

ADD COMMENTlink written 17 months ago by i.sudbery3.8k

Hey, thanks for your suggestions. I am interested in doing that for a co-expression network analysis, in particular i would be interested in only the positively correlated genes. What do you think?

ADD REPLYlink written 17 months ago by lessismore570

I guess the worry here is that lowly expressed genes have more noise and thus will screw up the correlations. I'm not sure anyone has every really considered this question, nor can I think of a principled approach.

The key thing with correlations is probably to get rid of the zeros. Too many zeros can cause a real problem. Other than that, you probably want to keep some pretty lowly expressed genes: you can't have a correlation if you only keep high expressed genes. You could use the simulate and local FDR method I linked to above, but I'm guess its not worth the bother. Your results are unlikely to be signficantly different to if you had just used a 1 TPM type threshold. Remember, the aim of bioinformatics is to extract biologically meaningful results, but be mathematically 100% correct. If a trend in your data is strong enough to be biologically meaningful, its probably strong enough to be insensitive to a range of sensible expression thresholds.

ADD REPLYlink written 17 months ago by i.sudbery3.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 835 users visited in the last hour