Question

Filtering RNA-seq expression matrix?

0

Entering edit mode

7.1 years ago

halo22 ▴ 300

Hello,

I am currently in the process of analyzing RNA-sequencing data and would love to have your input especially with regards to filtering the expression matrix. Currently, I am only removing rows(genes) which have a non-zero expression in at least 70% of my samples,which significantly reduces my gene expression matrix to ~20,000 rows.

I was interested in knowing the following: 1) Is it statistically acceptable to reduce the expression matrix from ~60,000 rows to ~20,000? 2) Is there any other filtering techniques that could be applied to this matrix? (May be variance based)

Note: For downstream analysis, I am using edgeR for normalization and limma for running a differential expression analysis.

Thanks

RNA-Seq sequencing rNA • 2.3k views

ADD COMMENT • link updated 7.1 years ago by theobroma22 ★ 1.2k • written 7.1 years ago by halo22 ▴ 300

score 0 · Answer 1 · 2017-04-11

1) It depends on your downstream analysis and your conclusions, and the technical and biological reasons for non-zero expression (such as expression below limit of detection). In another research field "filtering" was recently found to have led to wrong key messages in half of all publications of that field (where absent values are not random) ( https://academic.oup.com/pan/article/24/4/414/2276176/How-Multiple-Imputation-Makes-a-Difference )

2) Depending on specific analysis, and depth of sequencing, and modeling, you could set all Low-Expression genes (left peak) to a given value ( see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159973/ ) – and either only consider the High-Expression genes (right peak), or use a modeling approach that can treat the Low Expression / High Expression difference as a separate category

score 0 · Answer 2 · 2017-04-12

1) Is it statistically acceptable to reduce the expression matrix from ~60,000 rows to ~20,000?

I'm not quite sure about your question, but numerically, statistically, etc. what you have is a subset of your data based on a condition(s). If I understand correctly, a condition is those present in > 70% of all samples.

2) Is there any other filtering techniques that could be applied to this matrix? (May be variance based)

Why not compare this subset to other percentage subsets like 100% , 50% and 25%? This way you can also make Venn diagrams.