I'm carrying out DE analysis on RNASeq data in DESeq2. I have a question about the histogram plots I used to plot the p-adj values of my results. I'm referring to this post; http://varianceexplained.org/statistics/interpreting-pvalue-histogram/ to best explain my histogram plots!
I have two sets of results; X and Y. Result X was calculated first with no log fold threshold and padj < 0.05. When I plotted the padj values of Result X on a histogram I got an Anti-conservative histogram (similar to A in the website I'm referring to) with a large number of padj values < 0.05.
Result X resulted in a large number of DE genes so I decided to switch to Result Y where a log fold threshold of 1 was applied and then an FDR <0.05. I now have a set of genes that is more managable to do DE analysis on, however my concern with Result Y is that when I plotted the histogram of the padj values it resulted in a conservative plot (similar to D in the website) with a larger number of padj values = 1 than those < 0.05.
My question is; is histogram Y a concern? Going by the post attached is there something wrong with my dataset or as he mentioned is applying a log fold threshold a correction method? As long as histogram X is ok then my data is good to do statistical analysis on? Also I'm trying to understand what's happening the padj values in Result Y when the log fold threshold is applied? Is it the genes excluded from the data because of the threshold and they have no padj value assigned?
I get the same histogram when I plot the p-values of both results.
I'm count data is stored in the "transcript" dataframe under 2 conditions(pre and post) and 3 replicates per group. Design matrix is stored in the "coldata" dataframe.
Here's the code:
dim(transcripts)  4874 6 head(coldata) condition T1_pre pre T1_post post T2_pre pre T2_post post T3_pre pre T3_post post dds <- DESeqDataSetFromMatrix(countData = transcripts, colData = coldata, design = ~ condition) dds <- DESeq(dds) #Result X res <- results(dds) > table(res$padj < 0.05) FALSE TRUE 1095 3773 #Result Y resLFC <- results(dds, lfcThreshold = 1) > table(resLFC$padj < 0.05) FALSE TRUE 3991 877 hist( res$pvalue, breaks=20, col="grey" ) hist( resLFC$pvalue, breaks=20, col="grey" )
Histogram of results no log fold threshold:
Histogram of results log fold threshold < 1: