Question

Filtering lowly expressed genes

0

Entering edit mode

2.7 years ago

Barista ▴ 40

Hello all, I have a quick RNAseq (Quantseq) question for you all!

I am analyzing the Quantseq data for 500 patients and am finding my way through the bioinformatic forest.

Currently I am working on a way to filter out lowly expressed genes, and I am using the Bioconductor package in R to do so. I have thought of a way to do this, but I dont know if I am completely right in doing so and comments are greatly appreciated.

I am planning on filtering lowly expressed genes by CPM, my library sizes range from 1.7M to 7.9M. I want to keep genes that have >10 counts, but I want filter using CPM instead of raw counts as this also corrects for libsize. Can I just use the following formula "raw counts cutoff"/"minimum libsize in millions" = "CPM threshold"? So this would relate to a CPM filter value of 10/1.7 = 5.9?

As I am thinking of this, this seems to me as there is a lot of 'data' wasted, as there a bigger libraries in the dataset, but only the smallest library determines the cutoff for filtering. This means that in some libraries genes are discarded that have a count of >10. Would using another cut-off, for example the mean library size not be a better cutoff?

After this I want to keep genes with a CPM >5.9, in the manual from the EdgeR package they select if a CPM value is available in 2 or more rows as they use 2 biological replicates. As I dont have any biological replicates can I just select the genes with a CPM of 5.9 in any of the samples?

Any guidance through the forest would be greatly appreciated!

Genes Filtering Lowly Expressed RNAseq • 1.9k views

ADD COMMENT • link updated 24 months ago by Gordon Smyth ★ 7.0k • written 2.7 years ago by Barista ▴ 40

score 2 · Answer 1 · 2021-08-13

2

Entering edit mode

2.7 years ago

benformatics 3.9k

Since you're doing quantseq you don't have the inflation of read counts due to gene length... so you could probably just use a raw read count cutoff if you wanted (e.g. 10 reads in at least one sample).

The idea of filtering based on any normalized or un-normalized count-based cutoff is essentially arbitrary. You decide where you want to set a threshold. Usually I look at the distribution of CPM values and then decide on something that keeps a good chunk of my genes that I deem actively transcribed.

Either way your plan sounds fine. Although generally I use a whole number for my CPM cutoff based on the quantiles of cpm counts.

ADD COMMENT • link 2.7 years ago by benformatics 3.9k

0

Entering edit mode

Benformatics, thanks for your reply! Makes sense just to filter on raw counts as I am using QuantSeq.

Just tried both approaches and it one I keep 14561 genes in the other 11785. As the extra kept genes in the first approach are probably relatively lowly expressed, I think this would not make a really big difference in the DE analysis.

Anyway, many thanks for your help, it is greatly appreciated!

ADD REPLY • link 2.7 years ago by Barista ▴ 40

score 1 · Answer 2 · 2022-04-26

1

Entering edit mode

24 months ago

Gordon Smyth ★ 7.0k

The edgeR User's Guide actually uses filterByExpr, which solves all the problems that you mention. Is there any reason why you wouldn't use that?

And of course you do have biological replicates. Every patient is a replicate of some sort.

ADD COMMENT • link 24 months ago by Gordon Smyth ★ 7.0k