I am analysing an RNA-seq experiment with 8 treatment vs 8 control samples collected from primary tissue.
Having performed differential gene expression analysis (DGE) between these samples using an edgeR exact test and we noticed that genes that are specifically expressed in our cell type of interest (deduced through another separate study) are seemingly systematically reduced in expression in the treatment group.
We believe this is likely due to a difference in cell composition between the two samples despite due care taken in the collection procedure - resulting in more reads being consumed by genes specific to cells we aren't interested in reducing the number over the genes specific to the cells we are, therefore reducing the amount of data for these genes which is being falsely called as differential expression.
I was wondering, is it sound to subset the data to just the genes we are confident are specific to the cell type we want to investigate, normalise for the coverage across this gene set (like a pseudo-library size adjustment), and perform DGE just on this gene set? Perhaps with a conservative false discovery rate adjustment using the total number of expressed genes (not the number in the subset)?
Any advice would be greatly appreciated!