Question: Statistical test to compare the number of upregulated genes in a subgroup of genes versus all genes?
0
gravatar for biostart
4.2 years ago by
biostart350
Germany
biostart350 wrote:

Hello,

I have RNA-seq data for two cell conditions. I want to test a hypothesis that there are more significantly upregulated genes in a given group (~1000 genes) versus all signficantly differentially expressed genes genomewide (~7000 genes).

Say, the subset of interest contains X genes, including X1 upregulated and X2 downregulated. The total genes subset contains Y genes, including Y1 upregulated and Y2 downregulated. How to calculate the P value? I am looking for a simple equation.

Thank you!

P.S. Previous testing of this hypothesis based on quantitative tests (comparing log2 fold changes) was not very successful, resulting in a statistically significant but quite a small difference of about 0.2-0.3 on the log2 scale (see details in the previous thread here). However, there is a very large difference in the numbers of genes which become upregulated in a given subset versus all genes.

rna-seq • 1.8k views
ADD COMMENTlink modified 4.2 years ago by vakul.mohanty240 • written 4.2 years ago by biostart350

I'm not sure I follow. Do you really have 7k differentially expressed genes? It seems like an awful lot. Is it possible that you've got some unaccounted for bias in your experiment

ADD REPLYlink written 4.2 years ago by russhh5.5k

I meant all genes which have a statistically significant change (PPDE>0.95). This does not mean they have to have large log2 fold changes

ADD REPLYlink written 4.2 years ago by biostart350
1

Nonetheless, if half of the genes assayed are 'significant' by some measure, shouldn't you be questioning the validity of that measure

ADD REPLYlink written 4.2 years ago by russhh5.5k
1

It really depends on the number of gene tested. I don't know where you got the information that 7000 genes = half the genes tested in this case... It could be much more depending on the organism and whether the OP also includes ncRNA genes in his analysis.

ADD REPLYlink written 4.2 years ago by Carlo Yague5.2k

I guess, ideally the values for all expressed genes should be significant, which is never reached just because of not enough replicates, etc. This is different from differentially expressed genes, where you set a threshold of log2 fold change

ADD REPLYlink written 4.2 years ago by biostart350
2
gravatar for vakul.mohanty
4.2 years ago by
vakul.mohanty240
United States
vakul.mohanty240 wrote:

A hypergeometric enrichment test might do the job

ADD COMMENTlink written 4.2 years ago by vakul.mohanty240
1
gravatar for Carlo Yague
4.2 years ago by
Carlo Yague5.2k
Canada
Carlo Yague5.2k wrote:

Just use a Fisher's exact test. The equation is quite simple.

You can do it easily with R or Excel or even online.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Carlo Yague5.2k

Thanks, it looks like it is indeed either Fisher's test or chi-square test. Do you think Fisher is better in this case?

ADD REPLYlink written 4.2 years ago by biostart350
1

Yes, in almost all cases Fisher's test is best. It is also used a lot in scientific publications.

ADD REPLYlink written 4.2 years ago by Carlo Yague5.2k
0
gravatar for Whoknows
4.2 years ago by
Whoknows840
Tehran,Iran
Whoknows840 wrote:

Hi,

I don't know how many replicate you have in each condition but it would be great if you have at least 3- 4 replicates per condition, you can restrict your criteria for having less Sig. DE genes, e.g. choose log2(fold-change) >= 1 or >= 2, and try FDR/Q-value <0.05 , <0.01 or even <0.001

Additionally, based on your RNA-SEQ pipeline you can use different tools for finding Sig. DE genes:

  1. Tophat-Cufflinks it has its own Q-Value/FDR for choosing sig. DE genes.
  2. Bowtie/Tophat - HTseq, you could use variety of tools such as DESeq, DESeq2,edgeR or even limma for finding Sig. DE genes.

However you could do it by many other statistical tools or packages in R , for that reasons take a look at this page:

False Discovery Rate Analysis in R

This paper might help you to learn more about RNA-SEQ statistics:

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Whoknows840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1492 users visited in the last hour