Question

is there a value of ranking genes based on p-value in RNA-seq data

3

Entering edit mode

6.9 years ago

CrazyB ▴ 280

For my RNA-seq project, my collaborator generally provides a list of DEGs and the output is ranked based on p-value, not fold-change. While I understand the significance of p-value in the multiple-testing problems , I have always assumed once a gene passes the p-value cutoff, there is little difference between p-value = 1E-40, 1E-10 and 1E-3, biologically. As long as the gene passes the threshold, it'd the fold change that can explain the biology. Is this correct? Is there a value of ranking the genes based on p-values?

rna-seq • 4.8k views

ADD COMMENT • link updated 6.9 years ago by h.mon 35k • written 6.9 years ago by CrazyB ▴ 280

0

Entering edit mode

From what I understand p-value is not as relative as a fold change. For example, lets say you have gene A and gene B in conditions 1 and 2. Both have a Fold-change of 1 in both of them, however gene A changes from 3 counts to 6, while gene B changes from 30,000 counts to 60,000, naturally the p-value of gene B will be much lower thatn gene A's.

ADD REPLY • link 6.9 years ago by biofalconch ★ 1.1k

0

Entering edit mode

I think the answer in the linked question is reasonable what is the best p value cuttoff to select differentially expressed genes ?. It's just standard for scientists to look at which genes are most reliably differentially expressed via p-val. I did see an interesting way of ranking in the following article, however. They ranked each gene on p-val, then on logFC, and then took the mean of those two ranks to get the final rank. That way they're represented by two parameters.

ADD REPLY • link 6.9 years ago by CMosychuk ▴ 20

0

Entering edit mode

@biofalconch - thanks for pointing this out. I completely missed that aspect of the difference between fold-change and p-value.
@CMosychuk - thanks for the link to the Cell paper. Perhaps I did not fully understand the answer in the linked question, but it appeared that the OP in that linked question was only asking the cutoff. Beside the point from biofalconch, I am more interested in knowing that AFTER we decide on a cutoff value, which to me means every gene that satisfies the p-value cutoff should be a "reliable" gene with "significant" differential expression values (between ctrl and exp), is there additional value in sorting the gene list between "extremely highly reliable", "moderately highly reliable", "highly reliable" and "reliable"?

ADD REPLY • link 6.9 years ago by CrazyB ▴ 280

score 0 · Answer 1 · 2017-05-25

P-values and the like are somewhat complicated beasts, so I am out of my depth here, and take my answer with every grain of salt you have...

As long as the gene passes the threshold, it'd the fold change that can explain the biology. Is this correct?

I think this is not true, passing or not the threshold does not guarantee biological significance. A low p-value tells your data is unlikely under the null hypothesis, but it do not tell which is more likely:

null is true but your samples were funky, or not representative from the population you want to draw conclusions about
null is false

This is especially true for RNAseq data, as most experiments have a small number of biological replicates. Besides, our thresholds are somewhat arbitrary anyway.

Another layer of complexity is that genes are not sampled equally, as some genes are much more expressed than others. This in turn leads to two potential problems:

1) smaller differences will be called significant for highly expressed genes, while larger differences from lowly or moderately expressed genes will be non-significant. 2) the fold-change for more expressed genes will be more reliably estimated.

Also, interpreting RNAseq log(fold changes) is not straight-forward, as they are moderated: a small number is added to every count, to avoid infinite log(fold changes). This will mean log(fold changes) from genes with zero or low counts will be farther from the "truth" than genes with large counts, and this may affect their relative ranking.

Finally, I take issue with your "explain the biology" expression, as I think it is rather fuzzy and ill-defined.