I am doing Motif discovery of a set of ChipSeq peaks using GADEM under R statistical software. My question is purely theorical:

After the motif discovery, I conducted an anylysis with my discovered motif ( I annotated after with MotIV package) to know if it matchs in a set of gene sequences ( only the promoter region), the point is that I have genes with 1/2 binding sites and genes with more than 100 sites!, so I want to be the more stringent as I could with my list. So the question is, how many TFs binding sites do I have to consider to say that this paritcular gene is regulated by this TF motif? I was looking all the day in the literature and I didn't find anything. Someone could just point me where to read about it

An extra layer of information that is now available and worth considering is DNAse I hypersensitivity and digital footprinting data (which has been generated systematically in a variety of both human and mouse cell types by ENCODE). The idea is that in a particular cell type, regions that are transcriptionally active will be accessible and sensitive to DNAse, and at a fine scale, the actual sites bound by TFs will be protected.

See these papers:

Thanks Adrian, I guess I have to begin from there

Unfortunately there's no actual answer to your question. Whether a given binding site is actually used will depend heavily on cell-type, developmental context, and the nearby chromatin state. It could be that the genes with 100 binding sites in their promoter regions are never modulated by the transcription factor you're interested in in the context that you're looking. The only way to know would be the normal wet-lab experiments (and some promoter-bashing if you want to really know which site(s) is/are being used).

This question is impossible to answer with bioinformatics alone. Even ChIP-seq can potentially mislead you. Depending on the trancription factor, you could potentially have a binding event regulating a gene that is far from the nearest promoter. There is often a correlation between TF occupancy and nearby gene expression, but there are certainly long range interactions of importance in higher eukaroytes.

You could check to see if anyone has used RNAi to knockdown your factor in the animal or any cell line while monitoring occupancy using ChIP-seq and expression. This combination of data could be more informative in deciphering which sites may be occupied in a given context, and what genes the bound factor may be regulating.

The genome wide DNase data could be telling, particularly if hypersensitivity flanks a large number of your motifs on genes your hypothesize may be regulated by your factor, especially in a way that correlates with expression.


