Filtering lowly expressed genes
Entering edit mode
2.6 years ago
Barista ▴ 40

Hello all, I have a quick RNAseq (Quantseq) question for you all!

I am analyzing the Quantseq data for 500 patients and am finding my way through the bioinformatic forest.

Currently I am working on a way to filter out lowly expressed genes, and I am using the Bioconductor package in R to do so. I have thought of a way to do this, but I dont know if I am completely right in doing so and comments are greatly appreciated.

I am planning on filtering lowly expressed genes by CPM, my library sizes range from 1.7M to 7.9M. I want to keep genes that have >10 counts, but I want filter using CPM instead of raw counts as this also corrects for libsize. Can I just use the following formula "raw counts cutoff"/"minimum libsize in millions" = "CPM threshold"? So this would relate to a CPM filter value of 10/1.7 = 5.9?

As I am thinking of this, this seems to me as there is a lot of 'data' wasted, as there a bigger libraries in the dataset, but only the smallest library determines the cutoff for filtering. This means that in some libraries genes are discarded that have a count of >10. Would using another cut-off, for example the mean library size not be a better cutoff?

After this I want to keep genes with a CPM >5.9, in the manual from the EdgeR package they select if a CPM value is available in 2 or more rows as they use 2 biological replicates. As I dont have any biological replicates can I just select the genes with a CPM of 5.9 in any of the samples?

Any guidance through the forest would be greatly appreciated!

Genes Filtering Lowly Expressed RNAseq • 1.7k views
Entering edit mode
2.6 years ago

Since you're doing quantseq you don't have the inflation of read counts due to gene length... so you could probably just use a raw read count cutoff if you wanted (e.g. 10 reads in at least one sample).

The idea of filtering based on any normalized or un-normalized count-based cutoff is essentially arbitrary. You decide where you want to set a threshold. Usually I look at the distribution of CPM values and then decide on something that keeps a good chunk of my genes that I deem actively transcribed.

Either way your plan sounds fine. Although generally I use a whole number for my CPM cutoff based on the quantiles of cpm counts.

Entering edit mode

Benformatics, thanks for your reply! Makes sense just to filter on raw counts as I am using QuantSeq.

Just tried both approaches and it one I keep 14561 genes in the other 11785. As the extra kept genes in the first approach are probably relatively lowly expressed, I think this would not make a really big difference in the DE analysis.

Anyway, many thanks for your help, it is greatly appreciated!

Entering edit mode
22 months ago
Gordon Smyth ★ 6.8k

The edgeR User's Guide actually uses filterByExpr, which solves all the problems that you mention. Is there any reason why you wouldn't use that?

And of course you do have biological replicates. Every patient is a replicate of some sort.


Login before adding your answer.

Traffic: 1528 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6