Question: Statistical test to compare the number of upregulated genes in a subgroup of genes versus all genes?
0
3.7 years ago by
biostart340
Germany
biostart340 wrote:

Hello,

I have RNA-seq data for two cell conditions. I want to test a hypothesis that there are more significantly upregulated genes in a given group (~1000 genes) versus all signficantly differentially expressed genes genomewide (~7000 genes).

Say, the subset of interest contains X genes, including X1 upregulated and X2 downregulated. The total genes subset contains Y genes, including Y1 upregulated and Y2 downregulated. How to calculate the P value? I am looking for a simple equation.

Thank you!

P.S. Previous testing of this hypothesis based on quantitative tests (comparing log2 fold changes) was not very successful, resulting in a statistically significant but quite a small difference of about 0.2-0.3 on the log2 scale (see details in the previous thread here). However, there is a very large difference in the numbers of genes which become upregulated in a given subset versus all genes.

rna-seq • 1.6k views
modified 3.7 years ago by vakul.mohanty240 • written 3.7 years ago by biostart340

I'm not sure I follow. Do you really have 7k differentially expressed genes? It seems like an awful lot. Is it possible that you've got some unaccounted for bias in your experiment

I meant all genes which have a statistically significant change (PPDE>0.95). This does not mean they have to have large log2 fold changes

1

Nonetheless, if half of the genes assayed are 'significant' by some measure, shouldn't you be questioning the validity of that measure

1

It really depends on the number of gene tested. I don't know where you got the information that 7000 genes = half the genes tested in this case... It could be much more depending on the organism and whether the OP also includes ncRNA genes in his analysis.

I guess, ideally the values for all expressed genes should be significant, which is never reached just because of not enough replicates, etc. This is different from differentially expressed genes, where you set a threshold of log2 fold change

2
3.7 years ago by
vakul.mohanty240
United States
vakul.mohanty240 wrote:

A hypergeometric enrichment test might do the job

1
3.7 years ago by
Carlo Yague5.0k
Carlo Yague5.0k wrote:

Just use a Fisher's exact test. The equation is quite simple.

You can do it easily with R or Excel or even online.

Thanks, it looks like it is indeed either Fisher's test or chi-square test. Do you think Fisher is better in this case?

1

Yes, in almost all cases Fisher's test is best. It is also used a lot in scientific publications.

0
3.7 years ago by
Whoknows800
Tehran,Iran
Whoknows800 wrote:

Hi,

I don't know how many replicate you have in each condition but it would be great if you have at least 3- 4 replicates per condition, you can restrict your criteria for having less Sig. DE genes, e.g. choose log2(fold-change) >= 1 or >= 2, and try FDR/Q-value <0.05 , <0.01 or even <0.001

Additionally, based on your RNA-SEQ pipeline you can use different tools for finding Sig. DE genes:

1. Tophat-Cufflinks it has its own Q-Value/FDR for choosing sig. DE genes.
2. Bowtie/Tophat - HTseq, you could use variety of tools such as DESeq, DESeq2,edgeR or even limma for finding Sig. DE genes.

However you could do it by many other statistical tools or packages in R , for that reasons take a look at this page:

False Discovery Rate Analysis in R