Question: Filtering RNA-seq expression matrix?
0
gravatar for halo22
2.1 years ago by
halo22120
Indianapolis, IN
halo22120 wrote:

Hello,

I am currently in the process of analyzing RNA-sequencing data and would love to have your input especially with regards to filtering the expression matrix. Currently, I am only removing rows(genes) which have a non-zero expression in at least 70% of my samples,which significantly reduces my gene expression matrix to ~20,000 rows.

I was interested in knowing the following: 1) Is it statistically acceptable to reduce the expression matrix from ~60,000 rows to ~20,000? 2) Is there any other filtering techniques that could be applied to this matrix? (May be variance based)

Note: For downstream analysis, I am using edgeR for normalization and limma for running a differential expression analysis.

Thanks

sequencing rna-seq rna • 950 views
ADD COMMENTlink modified 2.1 years ago by theobroma221.1k • written 2.1 years ago by halo22120
0
gravatar for unksci
2.1 years ago by
unksci150
unksci150 wrote:

1) It depends on your downstream analysis and your conclusions, and the technical and biological reasons for non-zero expression (such as expression below limit of detection). In another research field "filtering" was recently found to have led to wrong key messages in half of all publications of that field (where absent values are not random) ( https://academic.oup.com/pan/article/24/4/414/2276176/How-Multiple-Imputation-Makes-a-Difference )

2) Depending on specific analysis, and depth of sequencing, and modeling, you could set all Low-Expression genes (left peak) to a given value ( see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159973/ ) – and either only consider the High-Expression genes (right peak), or use a modeling approach that can treat the Low Expression / High Expression difference as a separate category

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by unksci150
0
gravatar for theobroma22
2.1 years ago by
theobroma221.1k
theobroma221.1k wrote:

1) Is it statistically acceptable to reduce the expression matrix from ~60,000 rows to ~20,000?

I'm not quite sure about your question, but numerically, statistically, etc. what you have is a subset of your data based on a condition(s). If I understand correctly, a condition is those present in > 70% of all samples.

2) Is there any other filtering techniques that could be applied to this matrix? (May be variance based)

Why not compare this subset to other percentage subsets like 100% , 50% and 25%? This way you can also make Venn diagrams.

ADD COMMENTlink written 2.1 years ago by theobroma221.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 736 users visited in the last hour