Question: Calculate if the co-occurring of two TFBSs is higher than one would expect by chance within 1000bp of each other
1
10 months ago by
JJ430
JJ430 wrote:

Hi,

How could I calculate if the co-occurring of two TFBSs is higher than one would expect by chance? And this in either all promoters (1000 bp) or even in the complete genome within 1000bp of one another. I thought about this statistical problem but I am not an expert in probability ... and got stuck.

So my first thought was to calculate the random chance that two k-mers with length n and m co-occurr within 1000bp. But since I am not an expert in probability I am not sure how to calculate this. Any suggestions here?

Then I thought that the random chance might be unsuitable, as motifs are not simple k-mers but motif letter-probability matrices (meme format) so randomising the matrices might be a better idea? Maybe the best way would be to randomly switch around the values of each row? Would this be an acceptable approach? This there a tool for something like this?

It even gets more complicated as I have for one TFs three matrices which are quite similar. Here, I am not sure how to handle this. Any suggestions here?

Any insight is HIGHLY appreciated! Thank you :)

sequence genome • 439 views
modified 10 months ago by Alex Reynolds27k • written 10 months ago by JJ430
1
10 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

You might be able to use a hypergeometric test to calculate the odds of seeing co-occurrence of two TFs in promoters, as compared with co-occurrence of those two TFs across the entire genome (perhaps minus non-mappable regions or other regions where you would not expect to have countable TF binding sites).

In R, you could use phyper:

> phyper(a, b, c, d)

I think the following might work:

1. a = number of observations of co-occurrence of two TFs in 1k window over all promoters (number of times you see a pairing over all proximal promoters)
2. b = number of observations of co-occurrence of two TFs over 1k windows over whole genome (number of times you see a pairing over all whole-genome 1k windows)
3. c = number of observations of non-co-occurrence of two TFs over 1k windows over whole genome (number of times you don't see a pairing over all whole-genome 1k windows)
4. d = number of observations of co-occurrence and non co-occurrence of two TFs in 1k window over all promoters (total observations over promoters, i.e. total number of proximal promoters)

I might use bedops --chop to make 1k windows over the genome, bedops --difference to excise unmappable regions, bedops --range to make 1k windows upstream of the TSS (proximal promoters), and use of bedmap --echo-map-id with some awk scripting could help with counts of TF hit pairs-of-interest over proximal promoter regions, and counts of TF hit pairs-of-interest over the genome. Etc. Bringing counts into R should provide the p-value.

I'm not sure that comparing or counting kmers directly would work here, because DNA binding is imprecise and TFs will bind even if the region of DNA isn't a perfect match with the consensus sequence. That's why sequence logos have some bases at different heights, or information content levels, because the probability of seeing a particular base at a position in a binding site isn't 100%. If it was, we wouldn't need logos.

1

I think the hypergeometric distribution is correct. However, I think you want to set up a 2x2 table based on presence/absence of the TFs. Something like: a: number of promoters with both TFs; b: number of promoters containing TF1 BS, c:number of promoters with TF1 or TF2 present, d: number of promoters containing TF2. See this CV post for a little more detail: https://stats.stackexchange.com/questions/10328/using-rs-phyper-to-get-the-probability-of-list-overlap

You could apply this to promoters and then to the 1kb bins of the entire genome.

Thank you Alex Reynolds, genome vs. promoter region is also in interesting way of looking at it. I will do that too.

And also ejm32 - thank you for the link to the post. Based on this link, I searched some more and I found this explanation, which also describes it very well! Thanks.

The only thing I am still slightly unsure of are the genes without any TFBS in their promoter. Are these a separate class (hypergeometric only applies to two) or are these considered as "not picked"?? You wrote c: number of promoters with TF1 or TF2 present - so I do have to exclude the genes without any TFBS?

I would have considered these genes as part of the domain and TF1 & TF2 as the subsets. Hence, I would have put: a: (number of promoters with both TFs -1); b: number of promoters containing TF1 BS, c: total number of genes - number of promoters containing TF1 BS, d: number of promoters containing TF2.

1

You can put genes without TFBS1 or TFBS2 into the conditional table. My concern is that I feel it makes small overlaps significant. Furthermore if we go back to the balls in an urn analogy: the white balls = TF1 and the black balls = TF2. My advice do it both ways. I'll read your linked post and see if it sways me one way or the other.

Thank you very much for your input!!!

1

The idea with comparing genome vs promoter is that, in the ball-urn metaphor, the ball you're interested in is an event of co-occurrence, while the ball you're not interested in is an event of non-co-occurrence. The urn or "background" is the genome, from which you are sampling without replacement (promoters). Comparing frequencies of one TF versus a second TF seems to be asking a different question from co-occurrence, I think, but definitely tune this to what you think is appropriate, of course.

1

In the promoter vs non-promoter you are not testing if the TFs co-occur more frequently than expected. You are testing whether the two TFs co-occur more frequently in promoters than the rest of the genome. To me this assumes that the co-occurence of TF1/TF2 has already been established, which has not been done, yet. I think once an association is found OP can ask other questions related to where the co-occurence is most enriched. An even better comaprison might be to use DHS peaks/hotspots instead of promoters or compare promoters vs all other DHS (enhancers/cis reg elemets) or DHS vs non-DHS

Yes, of course these are two completely different questions - both interesting

A third question I am interested in would be if co-occurrence within a certain width is higher than chance. So let's say within 100 bp of one another in all promoters (or even genome). So these are then a subset of the co-occuring ones. TFBSs close to each other might indicate that the TFs that bind to them are more likely to be synergistic. Any ideas how to handle this statistically? Thanks again for your input!

On another note, would overlapping bins screw up the statistics?

1

Overlapping bins would complicate things... I think. I think overlapping bins would violate the independent draws assumption.