Question: Significance between two sets of DEGs from same species
0
17 months ago by
sakuraazalea10
sakuraazalea10 wrote:

I am analyzing honeybee RNA-seq data from two different studies.

Study 1 had 15,314 genes total with 118 DEGs. Study 2 had 11,825 genes total with 740 DEGs. There was an overlap of 67 between the two sets of DEGs.

I want to test whether this overlap is significant. I see one approach is to use Fisher Exact Test (https://rdrr.io/bioc/GeneOverlap/man/GeneOverlap.html). I am pretty sure I need to set up a 2*2 table but am unclear on the values. I am especially unclear on the first value Q below. I believe Q should be equal to N-(740+118-67), but am unsure of what value N should be used as there are two different total gene numbers (15,314 and 11,825).

fisher.test(matrix(c(Q, 740-67, 118-67, 67), nrow=2), alternative="greater")

What values should I used in this case? Thank you for sharing advice.

rna-seq fisher.exact • 442 views
modified 17 months ago by Nicolas Rosewick8.0k • written 17 months ago by sakuraazalea10

The link you provided doesn't work. When doing Fisher's Exact Test we typically set up the values using a contingency table (2*2). I would suggest making sure understand that first, then looking at Fisher's Exact Test.

0
17 months ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

You should first clean up each dataset by removing every gene not present in both studies. This can change the number of DEG identified in each dataset. Then, N= the number of genes tested in both studies.

0
17 months ago by
Belgium, Brussels
Nicolas Rosewick8.0k wrote:

You should use the total number of genes used in the annotation you used for the gene analysis. Did you redo the analysis workflow for both studies using the same analysis workflow and same annotation ? or did you just take the results from publications ? For the first solution you should then use the total number of genes in your annotation and perform a fisher test as you described in your question.

fisher.test(matrix(c(Q, 740-67, 118-67, 67), nrow=2), alternative="greater")

For the second solution, maybe you could use the union of the 15,314 and 11,825 gene list. Or better reperform the analysis to control that the datasets were analyzed in the same manner to avoid analysis bias.