Question

What is the best way to clean bulk RNA-seq data?

0

Entering edit mode

10 months ago

JACKY ▴ 140

As far as I know, there isn't a universally agreed-upon threshold or an approach to clean the data. I want to remove the genes that don't contribute, or in other words, the noise genes, BEFORE I normalize the data, using CPM or TPM or any other approach.

I've picked the threshold randomly, I tried not to set it too high so that I dont delete important genes that might have infomative value. This is my code:

 thresh = data > 0.5
  keep = rowSums(thresh) >= 1.5
  data = data[keep,]

What do you think? thanks!

normalization TPM r • 987 views

ADD COMMENT • link updated 10 months ago by rfran010 ▴ 900 • written 10 months ago by JACKY ▴ 140

score 2 · Answer 1 · 2023-05-29

For the sake of simplicity and because it well stood the test of time I always use use edgeR::filterByExpr() which has reasonable defaults. In general I would try to respect sample size. If you have 100 samples then a filter like "1 CPM in at least 3 samples" makes little sense. Rather "in at least 10 or 20 samples" would be fine.

score 0 · Answer 2 · 2023-05-30

0

Entering edit mode

10 months ago

swbarnes2 14k

There really is no "best". Just find something reasonable, and document what you chose.

ADD COMMENT • link 10 months ago by swbarnes2 14k

score 0 · Answer 3 · 2023-06-01

Honestly, I am a fan of visualizing to select a threshold since these can depend on the specific experiment. I would even just run standard DEG analysis with removing only 0-count genes and visualizing data to see which FPKM threshold removes the noisy genes. I am far from an expert on this though. Otherwise, I'd echo others, use something standard or reasonable and document.