Question

Interpreting TFBS enrichment from random genomic regions

0

Entering edit mode

2.6 years ago

Notorious ATG • 0

Hi all,

To make a long story short, my lab has developed a computational method for analyzing WGBS data that results in genomic regions of interest. These are regions defined by sequencing reads with mixed methylation states (i.e. consecutive CpGs having non-matching methylation). My boss wants me to use some kind of computational tool to evaluate whether these regions contain transcription factor binding site motifs, the idea being to link these epigenetic states with some kind of transcription factor/DNA binding protein.

I've looked at the data produced by our lab's method longer than anyone, even the program's creator, and I'm convinced that the output is basically meaningless from a biological perspective (at least in non-cancerous tissue).

Regions generated by this approach are always 100bp in length, and usually number in the mid thousands across the genome. When I feed these regions into programs like Homer, anything in the Meme suite (i.e. SEA, MEME-ChIP, AME) or even oPossum, I'll get a variety of enriched TFBS motifs within my regions. Some of these methods allow me to upload background sequences (chosen as equally sized regions with equivalent GC% and CpG density), and this makes no real difference in the number of enriched motifs I find.

This would be exciting, except for the fact that I get these kinds of results even if I choose an equal number of totally random genomic regions (with equivalent properties), essentially comparing one background set to another.

I guess you can probably tell that I'm looking for help in winning an argument, but I don't have anyone in my lab or circle of collaborators who can help me with this, and I'm pretty sure my time is being wasted in this pursuit.

So I'll ask a simple question: can these TFBS motif-finding tools distinguish between biologically meaningful data (i.e. ChIP-seq) and random genomic regions with equivalent properties? Is it inevitable that any set of, say, 5000 100bp regions with >= 2 CpGs will have some TFBS motifs enriched relative to a randomly chosen regions with equivalent properties?

Thanks for reading to the end, and thanks in advance for any input on this!

opossum meme homer transcription factor tfbs • 522 views

ADD COMMENT • link updated 2.6 years ago by ATpoint 82k • written 2.6 years ago by Notorious ATG • 0

0

Entering edit mode

Shouldn't Homer find background regions that are equivalent in properties to the foreground ones? Wondering why you get significant results despite random regions producing the same.

You could try and integrate other data such as ATAC-seq to see whether these motifs are in open chromatin regions. This would make it more likely that these motifs are meaningful as it would be accessable for transcription factors. If these motifs are always in closed chromatin then this would probably argue against relevance. What is the distance of the motifs to the closest active TSS, is it something "meaningful" such as (just saying) within 200kb of it, or are the motifs far away from active TSS? And if this is all not enough then do some experiments. I guess you have a hypothesis to link these motifs to something, like gene expression, a phenotype, anything that can be measured? Maybe a CRISPR-based screen with a dCas9. You recruit dCas9 to these sizes, hence blocking them and then see whether the measured phenotype changes. That would probably the most meaningful way to demonstrate relevance for these sites. Just thinking aloud here.

ADD REPLY • link 2.6 years ago by ATpoint 82k