Hello, I am writing an RNA-seq manuscript with a fellow lab member. I did everything except the bioinformatics. We have seven different tissues and three biological replicates for each, so a total of 21 libraries. We are comparing four of the tissues for a specific analysis. For this analysis, she normalized the counts of those 12 libraries (excluded the 9 others) and then identified DEGs. I suggested that she normalize all the libraries together, and then identify DEGs by doing contrast between all the libraries in the specific analysis (so the 12 libraries in this case).
She did both analyses and found that when comparing for example tissue A to tissue B, 1,221 (79%) genes were identified as upregulated by both normalization methods, 257 (16.6%) were identified by subset-sample-based (i.e., only the 12 libraries) normalization, and 67 (4.3%) were only identified by all-sample-based normalization. And she argues that "by normalizing with the samples/libraires of interest, we were able to identify more DEGs, which covered 94.5% of the DEGs identified by all-sample-based normalization. So, if we report subset-sample-based normalization results, we are missing 5% DEGs that can be identified otherwise; if we report all-sample-based normalization results, we are missing 17% DEGs that would be reported otherwise."
I think that the DEGs identified only by subset-sample-based normalization are likely not real DEGs because they are likely expressed in multiple tissues. The samples we included are not all the tissues in the plant we used, but the more tissues the more complete the picture, correct?