Question

TF binding site analysis - how to filter hits?

0

Entering edit mode

3.4 years ago

Nico80 ▴ 80

I have a list of genes and I am trying to determine whether they contain specific TF binding sites in their promoter region.

I am retrieving 3kb upstream of TSS using getPromoterSeq and then use TFBSTools and JASPAR2020 to look for TF binding sites.

To simplify things, let's take an example gene / TF pair

transcripts <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
sequence <- getPromoterSeq(query = transcripts$`19109`, subject = Mmusculus, 
                        upstream = 0, downstream = 3000)
pfm <- getMatrixByID(JASPAR2020, "MA0141.2")
res <- searchSeq(toPWM(pfm), sequence, min.score = "80%")
res.df <- data.frame(writeGFF3(res))
res.df$score <- relScore(res)[[1]]
res.df$pval <- pvalues(res)[[1]]

This returns four hits, as seen below

  seqname source feature start end     score strand frame
1       1   TFBS    TFBS    45  62 0.8484427      +     .
2       2   TFBS    TFBS    42  59 0.8484427      +     .
3       3   TFBS    TFBS    41  58 0.8484427      +     .
4       4   TFBS    TFBS    38  55 0.8484427      +     .
                                                                        attributes         pval
1 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
2 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
3 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
4 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05

If I perform the same analysis on the JASPAR2020 website I only get 1 hit, from position 45 to 62, but with a slightly lower score of 0.81323616527. In general, when I use multiple sequences and matrices, R always returns more hits than the website, and with consistently higher scores.

I have never done this type of analysis before, and I am wondering whether I am doing something wrong. In the simple example above, it is clear that (likely because of some redundancy in the motif) a lot of the hits overlap, and I am wondering whether the JASPAR website gets rid of those? This, however, does not explain some of the missing hits, as they are not overlapping.

This brings me to my next question, which is about extra filtering. It generally looks like I get a lot of hits. The JASPAR website warns that "this type of analysis has a high sensitivity but abysmal selectivity", so I am wondering what the best approach is to further validate hits. They talk about phylogenetic footprinting, but it is not too clear to me how to implement it.

Would a bootstrapping approach using randomly shuffled matrices be good?

TFBStools JASPAR R • 1.5k views

ADD COMMENT • link updated 3.4 years ago by Alex Reynolds 35k • written 3.4 years ago by Nico80 ▴ 80

score 3 · Answer 1 · 2020-11-18

Another option is to use FIMO to discover motifs in your assembly's genomic sequence (mm10).

You provide a MEME-formatted database of transcription factor PWMs (which JASPAR publishes), your assembly sequence, and your discovery threshold (e.g., 1e-4 or 1e-5 — lower returns fewer, but "better" hits).

I did a short write-up here on how to run FIMO for hg19:

https://bioinformatics.stackexchange.com/questions/2467/where-to-download-jaspar-tfbs-motif-bed-file

It's not the only way to run things, but it hopefully might help with literal commands you could tweak on your end for mm10, to get a feel for how you might use it.

When you run the steps listed above, FIMO gives you a set of hits across your genome in a BED-formatted file. You can save this file somewhere and use it over and over to find TFBS, as needed, as your experimental procedure changes.

At this point, for answering your specific question, you can use a set operations toolkit like BEDOPS to filter the whole-genome TFBS set against your candidate regions (i.e., your 3kb windows downstream of the promoter), e.g. look for TFBS hits that overlap promoter windows by three or more bases:

$ bedops --element-of 3 fimo_hits.bed promoter_windows.bed > answer.bed

You could again use set operations to compare the hits you get in R with what comes out of answer.bed. If there is decent concordance, you probably have a good set of hits for further analysis.

Overlapping hits can come from repeat or low-complexity regions, or intervals which are palindromic or near-palindromic, enough to match the sequence pattern.