Question

How To Avoid False Negative And False Positive When Searching For Tfbs?

7

Entering edit mode

13.2 years ago

Anima Mundi ★ 2.9k

Hello,

I would like to know how to avoid false negative and false positive when searching for TFBS. I tested some tools for transcription binding sites prediction using 300, 500, 1000, 1500, 2000 bp upstream of mouse Nanog gene's transcription start. I used matrices for Sox2, FoxD3, Stat3, and some others as a negative control. I have never obtained a result truly reflecting the actual situation of this promoter.

transcription binding mouse conversion • 13k views

ADD COMMENT • link updated 13.2 years ago by Doo ▴ 240 • written 13.2 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

which tools did you try?

ADD REPLY • link 13.2 years ago by Michael 54k

score 12 · Answer 1 · 2011-02-03

The short answer is you cannot avoid both types of error.

In order to avoid false negatives (increase the sensitivity of your test), you will have to allow more false positives (decrease the specificity). Conversely, in order to improve the specificity, you will have to take a hit in the sensitivity.

Think of the extremes for an illustration of this. If I were to call every possible site as a true binding site, I would have no false negatives (for I could not possibly have misclassified something as 'negative' if I have no negatives), but I would have many, many false positives. On the other hand, if I called no positives whatsoever, I could guarantee that I have no false positives, but I would have a large number of false negatives.

Clearly, no real test reflects exactly the situation I've described, but every test that you design is a trade-off between the sensitivity and the specificity. I'm afraid that there is no way of maximising both simultaneously.

score 9 · Answer 2 · 2011-02-03

This is very difficult, as the computational approach does not take things into account factors such as chromatin structure, co-factors, competitive inhibition etc etc, so there will always be false positives and negatives. You can minimise false positives by using more stringent cutoffs, or minimise false negatives by increasing the cutoff, but you can not do both. Most methods compare to some background set of sequences and determine if you have more sites than you would expect by chance, but using just one sequence makes the results very sensitive to the cutoff you use. In short it is very difficult.

score 2 · Answer 3 · 2011-02-09

As the other answers have suggested this is a very hard problem with no generic solution. Take a look at "Applied bioinformatics for the identification of regulatory elements" Wasserman W, Sandelin A, Nature Reviews Genetics, 2004 vol. 5 (4) pp. 276-287. They assert "essentially all predicted transcription-factor (TF) binding sites that are generated with models for the binding of individual TFs will have no functional role." They call this "The futility theorem".

score 2 · Answer 4 · 2011-03-02

The information content of binding matrices is around 4-6 bits, so you expect a false negative every 64 to several hundred basepairs. It impossible to get the right answer by just scanning with binding site matrices. They might give some clues, but not more.

I have collected lots of Chip-(seq/chip) data for embryonic stem cells. Have you had a look if there is published Chip experiments for Sox2, FoxD3 or Stat3 in the cell population that you are interested in?

score 1 · Answer 5 · 2011-03-02

Ernst et al. developed a score which indicates if a transcription factor physically can bind at the DNA by using a combination of epigenetics data (from GC content to histone modification to Chip-seq data, etc etc). Unfortunately, this is only for human afaik, but the transcriptional regulation might be conserved?

Ernst et al. Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res (2010) vol. 20 (4) pp. 526-36

score 0 · Answer 6 · 2011-02-04

What you state is right, I am particularly interested in reducing false positives. I am currently using shuffling algorithms on matrices to validate my results, because I would like to avoid any bias related to the introduction of elements of compare. Should I backtrack a bit? Eventually, how would you optimize the choose of a background set of sequences?

score 0 · Answer 7 · 2011-02-08

It is a fundamental problem as stated by Simon Cockell,[?] There are some standards for finding good cutoff that is balanced. first calculate or estimate the score distribution of each motif matrix. then calculate the cutoff that will make the "false positive" error, based on the score distribution. [?][?]Finally, choose a cutoff score that the log(fpr)=-ic(M) [?] where fpr = false positive rate [?] and ic(M) is the information content of matrix M [?][?] I think that you need a more controlled dataset to compare to in order to asses the false sites. for example - you may want to look for evolutionary conserved sites, or to use known transcription response sites from databases such as FANTOM3/4 http://fantom.gsc.riken.jp/4/