TF binding site analysis - how to filter hits?
1
0
Entering edit mode
3.4 years ago
Nico80 ▴ 80

I have a list of genes and I am trying to determine whether they contain specific TF binding sites in their promoter region.

I am retrieving 3kb upstream of TSS using getPromoterSeq and then use TFBSTools and JASPAR2020 to look for TF binding sites.

To simplify things, let's take an example gene / TF pair

transcripts <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
sequence <- getPromoterSeq(query = transcripts$`19109`, subject = Mmusculus, 
                        upstream = 0, downstream = 3000)
pfm <- getMatrixByID(JASPAR2020, "MA0141.2")
res <- searchSeq(toPWM(pfm), sequence, min.score = "80%")
res.df <- data.frame(writeGFF3(res))
res.df$score <- relScore(res)[[1]]
res.df$pval <- pvalues(res)[[1]]

This returns four hits, as seen below

  seqname source feature start end     score strand frame
1       1   TFBS    TFBS    45  62 0.8484427      +     .
2       2   TFBS    TFBS    42  59 0.8484427      +     .
3       3   TFBS    TFBS    41  58 0.8484427      +     .
4       4   TFBS    TFBS    38  55 0.8484427      +     .
                                                                        attributes         pval
1 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
2 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
3 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
4 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05

If I perform the same analysis on the JASPAR2020 website I only get 1 hit, from position 45 to 62, but with a slightly lower score of 0.81323616527. In general, when I use multiple sequences and matrices, R always returns more hits than the website, and with consistently higher scores.

I have never done this type of analysis before, and I am wondering whether I am doing something wrong. In the simple example above, it is clear that (likely because of some redundancy in the motif) a lot of the hits overlap, and I am wondering whether the JASPAR website gets rid of those? This, however, does not explain some of the missing hits, as they are not overlapping.

This brings me to my next question, which is about extra filtering. It generally looks like I get a lot of hits. The JASPAR website warns that "this type of analysis has a high sensitivity but abysmal selectivity", so I am wondering what the best approach is to further validate hits. They talk about phylogenetic footprinting, but it is not too clear to me how to implement it.

Would a bootstrapping approach using randomly shuffled matrices be good?

TFBStools JASPAR R • 1.5k views
ADD COMMENT
3
Entering edit mode
3.4 years ago

Another option is to use FIMO to discover motifs in your assembly's genomic sequence (mm10).

You provide a MEME-formatted database of transcription factor PWMs (which JASPAR publishes), your assembly sequence, and your discovery threshold (e.g., 1e-4 or 1e-5 — lower returns fewer, but "better" hits).

I did a short write-up here on how to run FIMO for hg19:

https://bioinformatics.stackexchange.com/questions/2467/where-to-download-jaspar-tfbs-motif-bed-file

It's not the only way to run things, but it hopefully might help with literal commands you could tweak on your end for mm10, to get a feel for how you might use it.

When you run the steps listed above, FIMO gives you a set of hits across your genome in a BED-formatted file. You can save this file somewhere and use it over and over to find TFBS, as needed, as your experimental procedure changes.

At this point, for answering your specific question, you can use a set operations toolkit like BEDOPS to filter the whole-genome TFBS set against your candidate regions (i.e., your 3kb windows downstream of the promoter), e.g. look for TFBS hits that overlap promoter windows by three or more bases:

$ bedops --element-of 3 fimo_hits.bed promoter_windows.bed > answer.bed

You could again use set operations to compare the hits you get in R with what comes out of answer.bed. If there is decent concordance, you probably have a good set of hits for further analysis.

Overlapping hits can come from repeat or low-complexity regions, or intervals which are palindromic or near-palindromic, enough to match the sequence pattern.

ADD COMMENT
0
Entering edit mode

Thank you very much, Alex, I will read through that and see if I can get my way through it!

ADD REPLY
1
Entering edit mode

On top of that (seconding the fimo suggestion) I would narrow down the promoters to like 500bp, not 3kb to avoid excessive motif matches by chance. By experience with ATAC-seq data the actual open chromatin region upstream of TSS is by far not 3kb, rather like 500bp or so. This eventually helps reducing the multiple-testing burden (if you choose to use the q-values of fimo).

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6