I am looking for some help with understanding the most logical way (mathematically speaking) that I can reduce post-normalized rna-seq counts in order to fit varying regression/tree-based models for the purpose of phenotype prediction.
- 60 "treatments": A treatment in this case is a particular full-sib family
- 200 Biological replicates: Each of the treatments (i.e. each full-sib family) has roughly 3~4 biological replicates
- Minimal pre-filtering was done on raw count data to remove transcripts with all 0's
Raw count data was normalized with a linear mixed model to account for lane, index, and familial relationships
-- counts were log2 transformed prior to normalization and given an offset of 1
-- output of normalization process is log2 counts
The normalized count matrix is now 200 x 70,000 and I would like to filter out transcripts in a way which removes the least amount of biological variation. The objective would be to get a smaller subset of around 10-20K which I could use as the input to caret for prediction modeling.
Question 1) Can I filter on these log transformed counts?
Question 2) If I wanted to estimate the dispersion of my normalized counts, would this make sense to do using the log-transformed or exponentiated counts? Does it even make sense to filter on dispersion? ("cries for help")
Question 3) Generally speaking, what are common practices for filtering RNA-Seq for the purposes of prediction (not necessarily for DGE)