Histogram plot of padj values
2
0
Entering edit mode
4.7 years ago
cado • 0

Hi,

I'm carrying out DE analysis on RNASeq data in DESeq2. I have a question about the histogram plots I used to plot the p-adj values of my results. I'm referring to this post; http://varianceexplained.org/statistics/interpreting-pvalue-histogram/ to best explain my histogram plots!

I have two sets of results; X and Y. Result X was calculated first with no log fold threshold and padj < 0.05. When I plotted the padj values of Result X on a histogram I got an Anti-conservative histogram (similar to A in the website I'm referring to) with a large number of padj values < 0.05.

Result X resulted in a large number of DE genes so I decided to switch to Result Y where a log fold threshold of 1 was applied and then an FDR <0.05. I now have a set of genes that is more managable to do DE analysis on, however my concern with Result Y is that when I plotted the histogram of the padj values it resulted in a conservative plot (similar to D in the website) with a larger number of padj values = 1 than those < 0.05.

My question is; is histogram Y a concern? Going by the post attached is there something wrong with my dataset or as he mentioned is applying a log fold threshold a correction method? As long as histogram X is ok then my data is good to do statistical analysis on? Also I'm trying to understand what's happening the padj values in Result Y when the log fold threshold is applied? Is it the genes excluded from the data because of the threshold and they have no padj value assigned?

EDIT:

I get the same histogram when I plot the p-values of both results.

I'm count data is stored in the "transcript" dataframe under 2 conditions(pre and post) and 3 replicates per group. Design matrix is stored in the "coldata" dataframe.

Here's the code:

dim(transcripts)
[1] 4874    6

 head(coldata)
         condition
T1_pre        pre
T1_post      post
T2_pre        pre
T2_post      post
T3_pre        pre
T3_post      post


dds <- DESeqDataSetFromMatrix(countData = transcripts,
                              colData = coldata, 
                              design = ~ condition)

dds <- DESeq(dds)    

#Result X
res <- results(dds)

> table(res$padj < 0.05)

FALSE  TRUE 
 1095  3773 

#Result Y
resLFC <- results(dds, lfcThreshold = 1)

> table(resLFC$padj < 0.05)

FALSE  TRUE 
 3991   877 

hist( res$pvalue, breaks=20, col="grey" )
hist( resLFC$pvalue, breaks=20, col="grey" )

Histogram of results no log fold threshold:

Histogram of results log fold threshold < 1:

RNA-Seq statistics histograms • 2.8k views
ADD COMMENT
1
Entering edit mode

You should plot the nominal p-values histogram ; not the adjusted ones. Also could you describe the different commands you used to build your results i.e. the DESeq2 commands and also define the number of samples, number of groups, experimental design, etc...

ADD REPLY
0
Entering edit mode

To quote the linked page:

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values.

ADD REPLY
1
Entering edit mode

Please do not add answers unless you're answering the top level question. You can edit your post and add these details. Your images are also added improperly. I'll clean up your post, but please put some more effort in the future.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I was checking the website that you provided. This might be helpful: "Don’t apply false discovery rate control to these p-values yet. (Why not? Because some kinds of FDR control are based on the assumption that your p-values near 1 are uniform. If you break this assumption, you’ll get way fewer significant hypotheses. Everyone loses)."

ADD REPLY

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6