Question

How To Calculate Over-Representation Of Tfbs Of Single Tf Per Gene

2

Entering edit mode

11.3 years ago

gozuyasli ▴ 20

I am now trying to locate single specific transcription factor binding site to over 100kb sequences of ~1000 genes. But it does not matter how good the binding matrix is and how much I minimize the false positive rate, every matrix has a specific error rate. That's why binding site will be found in every gene in such long sequences. So, I want to find genes enriched in that specific binding site in their regulatory sequence.

Which test should I use and how for such enrichment analysis?

I can calculate the number of hits per gene in test genes and I approximately know the error rate of binding matrix per kb for given cut-off for similarity (given in Transfac database).

Thanks for help.

enrichment transcription statistics genomics prediction • 4.6k views

ADD COMMENT • link updated 11.3 years ago by md5sum ▴ 50 • written 11.3 years ago by gozuyasli ▴ 20

0

Entering edit mode

If i understand correctly you are looking at ~1000x 100kb sequences. If so this is probably inadvisable as sequences of this length are likely to cover other gene regulatory regions. Apologies if i misunderstood! The problem is chiefly that there will be a lot of background noise generated from non-gene-of-interest genes.

ADD REPLY • link 11.3 years ago by Ian 6.0k

score 0 · Answer 1 · 2013-01-08

You should have a look at the new oPOSSUM3 tool published recently (http://www.ncbi.nlm.nih.gov/pubmed/22973536).

It is a web-based system for the detection of over-represented conserved transcription factor binding sites and binding site combinations in sets of genes or sequences.

http://opossum.cisreg.ca/oPOSSUM3/

score 0 · Answer 2 · 2013-01-08

GREAT-Genomic Regions Enrichment of Annotations Tool.
This might also be helpful.

You should also try these:
CisFinder - tool for finding over-representing short DNA motifs.
F-Match - tool for identifying statistically over-represented transcription factor binding sites (TFBS) in a set of sequences compared against a control set.

score 0 · Answer 3 · 2013-01-08

Great is something different actually. it associates genomic regions for genes in your area of interest and then perform an enrichment analysis for these genes for GO terms, pathways and such.

I checked opossum shortly but as far as I understand it finds the enriched transcription factor binding sites for group of genes. So it will tell me whether this group of genes are regulated by this factor or not. But it wont tell me whether each of these genes has enriched binding site for this TF.

What I wanted was much simpler version of opossum. For instance, I found 16 binding sites for TF1 in regulatory sequence of geneA and I expect to find 12 binding site for this size of any sequence just by chance. So, is regulatory sequence of geneA really enriched for TF1 binding site or do I observe such number of binding sites just because of false positive error.

I decided to use hypergeometric test at the end. Counting number of hits and non-hits for test sequence and calculate the number of hits by chance using false positive error rate and than apply fisher's exact test. This would give me p value of enrichment.

Are there any better approaches for such problems around?

score 0 · Answer 4 · 2013-01-09

0

Entering edit mode

11.3 years ago

md5sum ▴ 50

MAST, which is part of the MEME suite tests for enrichment of a single motif/pssm/tfbs in a single sequence.

ADD COMMENT • link 11.3 years ago by md5sum ▴ 50