17 months ago by
Seattle, WA USA
It's a bit of work, maybe, but perhaps the following set operations could guide some investigation.
Your example TF is MA0528.1, which is a Jaspar identifier.
For your genome of interest, you could run that genome's sequence through FIMO to call binding sites of Jaspar TF models at some threshold, say 1e-4 or 1e-5. Say this file is called
Given a set of whole-genome binding sites, you can then filter that set using the proximal promoters of all genes of interest (
genes.bed). These could be Gencode genes in GFF format, converted to BED via
gff2bed, or by way of similar approaches.
Proximal promoters could be defined as a 1kb region upstream of the gene's TSS:
$ bedops -u <( awk ($6="+") genes.bed | bedops --range -1000:0 - ) <( awk ($6="-") genes.bed | bedops --range 0:1000 - ) > promoters.bed
Then filter the whole-genome TFBS set :
$ bedops --element-of 1 tbfs.jaspar.1e-5.bed promoters.bed > tbfs.jaspar.1e-5.subset.bed
grep this subset for
$ grep MA0528.1 tbfs.jaspar.1e-5.subset.bed > MA0528.1.hits.bed
and map these hits back to the genes:
$ bedmap --range 1000 --echo --skip-unmapped genes.bed MA0528.1.hits.bed > answer.bed
You might add TF-specific ChIP-seq data overlaps as experimental evidence of concordance of gene promoters derived from
answer.bed with TFs of interest actually binding to those regions in real life.