What is the best way to clean bulk RNA-seq data?
3
0
Entering edit mode
10 months ago
JACKY ▴ 140

As far as I know, there isn't a universally agreed-upon threshold or an approach to clean the data. I want to remove the genes that don't contribute, or in other words, the noise genes, BEFORE I normalize the data, using CPM or TPM or any other approach.

I've picked the threshold randomly, I tried not to set it too high so that I dont delete important genes that might have infomative value. This is my code:

 thresh = data > 0.5
  keep = rowSums(thresh) >= 1.5
  data = data[keep,]

What do you think? thanks!

normalization TPM r • 987 views
ADD COMMENT
2
Entering edit mode
10 months ago
ATpoint 81k

For the sake of simplicity and because it well stood the test of time I always use use edgeR::filterByExpr() which has reasonable defaults. In general I would try to respect sample size. If you have 100 samples then a filter like "1 CPM in at least 3 samples" makes little sense. Rather "in at least 10 or 20 samples" would be fine.

ADD COMMENT
0
Entering edit mode
10 months ago

There really is no "best". Just find something reasonable, and document what you chose.

ADD COMMENT
0
Entering edit mode
10 months ago
rfran010 ▴ 900

Honestly, I am a fan of visualizing to select a threshold since these can depend on the specific experiment. I would even just run standard DEG analysis with removing only 0-count genes and visualizing data to see which FPKM threshold removes the noisy genes. I am far from an expert on this though. Otherwise, I'd echo others, use something standard or reasonable and document.

ADD COMMENT

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6