I am doing a RNAseq analysis for DE genes. In my initial analysis, using the arbitrary cutoff of 1 CPM for filtering out lowly expressed genes, I found ~20 DE genes (i.e. genes with FDR = p.adjust(method="BH") < 0.05). I have been working to understand the BH correction, and in my exploration, have generated the following plot from my filtered dataset (retaining 14,000 genes after filtering):
Am I correct in understanding that this histogram implies that I have upwards of 1,000 DE genes? I am making this assumption by considering those counts over 500 for the two bins corresponding to p<=0.05, since the mean counts of the bins with p > 0.50 appears to be ~400 (so, selecting 500 as a conservative upper limit).
I understand that the BH adjustment must lose some true positive discoveries in its aims to bring the false positive rate to the advertised 5%, as well as the inverse dependence of the number of true positives lost with the overall number of comparisons made for large n. As such, I sought to emulate the DEseq2 method of setting the cutoff CPM value used for filtering so as to maximize the number of results with FDR<=0.05. I accomplished this with some ugly but effective 'for' loops, iterating through a range of cutoff CPM values and tracking the number of FDR<=0.05 results for each. When I did this around the range of cutoff CPM values suggested in the edgeR documentation (the recommendation being to filter out genes with less than 10-15 counts per sample, which for my samples corresponds to 0.7-1.1 CPM), I achieved the following results:
I repeated this using a broader range of CPM values (my computer was not pleased to say the least), and was able to find that at Cutoff CPM = 18 I can get ~35 DE gene with FDR<=0.05. I am worried, however, that this cutoff CPM is too high and I may be missing out on some lowly, but differentially, expressed genes (I have biological reasons to believe this after examining these results, also)
In my research on finding appropriate filtering parameters in RNAseq analysis, I stumbled upon this paper on independent hypothesis weighting (IHW): https://www.nature.com/articles/nmeth.3885#MOESM282 . Considering the BH adjustment to possibly be incompatible with my data (if so, why?) I re-did my analysis with the IHW method as detailed in this paper. With no prior filtering I found 95 DE genes with adjusted p value <= 0.05, but I was unsure if filtering was necessary or not before such an IHW analysis (my logic was that the hypothesis weighting would filter out lowly expressed samples as needed).
I now feel as though I may be seeking a method that gives me the results that I want rather than using a predetermined, robust method, which I know to be a dangerous statistical practice (though this being my first such analysis makes this feel at least partially inevitable).
That said, why is the number of FDR<=0.05 genes so significantly less than the p-value histogram would suggest? Is this simply the nature of high-throughput/multiple-comparison testing and I am misunderstanding the math behind it? Or am I doing something wrong in my analysis?
If anybody could help me better understand this process or point me in the direction of papers, books, or any other resources to help my understanding I would be greatly appreciative!