Question: Gene list overlap - Null distribution
gravatar for Wario
3.1 years ago by
Wario0 wrote:

Hello everyone,

This is probably a stupid question but I need help.

I want to calculate the null distribution for the gene overlap between 2 lists.

The first list is Chip-seq data and the second RNA-seq. And the background genome is 20000 thousand genes. I have this data for 50 samples.

The first sample has a Chip list with 751 genes and a 590 RNA-seq gene list.

I tried it with r but the result looks odd.

ts = replicate(5000,t.test(rnorm(751),rnorm(590))$statistic) 

pts = seq(-3.5, 3.5,length=100)
rna-seq chip-seq • 761 views
ADD COMMENTlink modified 3.1 years ago by i.sudbery7.0k • written 3.1 years ago by Wario0

I formatted your code (using the 101010 button) for readability, but perhaps you should check I did it correctly.

ADD REPLYlink written 3.1 years ago by WouterDeCoster43k

Thanks, didn't know about that.

ADD REPLYlink written 3.1 years ago by Wario0
gravatar for i.sudbery
3.1 years ago by
Sheffield, UK
i.sudbery7.0k wrote:

As was mentioned by @Lars Juhl Jensen, the standard null distribution for two gene lists is the hypergeometric distribution. However, this assumes that all genes are independent and equally likely to show up. There are several reasons why this might not be the case:

  • Longer genes are more likely to be called differentially expressed as you have more power to detect (higher read numbers)
  • You don't say how your chip-seq gene list is devired. If it is by overlapping with the gene region, then again, longer genes are more likely to overlap if you are assigning peaks to genes based on a promoter region or gene territory, are all promoters/territories the same length?

There are a couple of ways around this. First the pacakge goseq is designed to manage gene length bias in differential expression analysis. While you are not doing GO analysis, the problem is conceptually equivalent.

Alternatively the program GAT (gene association tester), tests whether a set of intervals overlaps with another set of intervals more often than you would expect, accounting for all length bias, GC content bias etc.

ADD COMMENTlink written 3.1 years ago by i.sudbery7.0k
gravatar for Lars Juhl Jensen
3.1 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

You could model this with a simple hypergeometric distribution, if you make the assumption that all genes are equally likely to appear on the two lists.

ADD COMMENTlink written 3.1 years ago by Lars Juhl Jensen11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1466 users visited in the last hour