Question

background nucleotide probability when finding TF binding sites in GRCh38

0

Entering edit mode

19 months ago

chrisclarkson100 ▴ 150

I'm using the R-package TFBSTools to prodict TF binding sites:

pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+",
      bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), 
      profileMatrix=as.matrix(pfm))
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)

I am curious as to what I should use as a background probability for the nucleotides (see that I have simply used prob=0.25 for all 4 nucs)... I can't seem to find an official reference for the GRCh38 genome of this kind anywhere... I found with the R package MEET a reference probability list: c(A=0.32,T=0.32,G=0.18,C=0.18).

However I am not certain if this profile is suitable in this situation- given that not all regions in the genome maintain these ratios (e.g. genes are GC rich while non-coding regions are AT rich)...

Does anyone know if I should just stick to the 0.25 prob split 4 ways or is a tailored profile more appropriate?

binding transcription factor • 434 views

ADD COMMENT • link updated 19 months ago by Matthias Zepper 4.6k • written 19 months ago by chrisclarkson100 ▴ 150

score 1 · Answer 1 · 2022-09-19

Usually, this kind of information can be found in the vignettes that accompany Biocondutor packages. You can at least see, if they bother to use any particular values.

My gut feeling would be, that it doesn't matter too much since this information is just used for the pseudocount calculation. You can also generate two PWMs with different bg values and try it out, if the number of resulting peaks is a lot different?