I have a list of genes and I am trying to determine whether they contain specific TF binding sites in their promoter region.
I am retrieving 3kb upstream of TSS using
getPromoterSeq and then use
JASPAR2020 to look for TF binding sites.
To simplify things, let's take an example gene / TF pair
transcripts <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") sequence <- getPromoterSeq(query = transcripts$`19109`, subject = Mmusculus, upstream = 0, downstream = 3000) pfm <- getMatrixByID(JASPAR2020, "MA0141.2") res <- searchSeq(toPWM(pfm), sequence, min.score = "80%") res.df <- data.frame(writeGFF3(res)) res.df$score <- relScore(res)[] res.df$pval <- pvalues(res)[]
This returns four hits, as seen below
seqname source feature start end score strand frame 1 1 TFBS TFBS 45 62 0.8484427 + . 2 2 TFBS TFBS 42 59 0.8484427 + . 3 3 TFBS TFBS 41 58 0.8484427 + . 4 4 TFBS TFBS 38 55 0.8484427 + . attributes pval 1 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05 2 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05 3 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05 4 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
If I perform the same analysis on the JASPAR2020 website I only get 1 hit, from position 45 to 62, but with a slightly lower score of 0.81323616527. In general, when I use multiple sequences and matrices, R always returns more hits than the website, and with consistently higher scores.
I have never done this type of analysis before, and I am wondering whether I am doing something wrong. In the simple example above, it is clear that (likely because of some redundancy in the motif) a lot of the hits overlap, and I am wondering whether the JASPAR website gets rid of those? This, however, does not explain some of the missing hits, as they are not overlapping.
This brings me to my next question, which is about extra filtering. It generally looks like I get a lot of hits. The JASPAR website warns that "this type of analysis has a high sensitivity but abysmal selectivity", so I am wondering what the best approach is to further validate hits. They talk about phylogenetic footprinting, but it is not too clear to me how to implement it.
Would a bootstrapping approach using randomly shuffled matrices be good?