Scanning for TFBSs of known motifs in all human promoters
1
3
Entering edit mode
6.0 years ago
JJ ▴ 670

Hi,

I am hoping for some insight and suggestions :)

I am interested in the TFBSs of 3 TFs in all promoter sequences of the human genome (e.g. 1000bp upstream) - in particular, I am looking for co-occurring TFBSs of these 3 TFs. I have the motifs available through the JASPAR database.

I found following solutions:

  • FIMO (Meme Suite) & findMotifs.pl or scanMotifGenomeWide.pl (HOMER) offer a way of scanning for them using known motifs. Each motif is treated separately – I then computed the co-occurring ones in R.

  • MCAST (Meme Suite) offers to scan for clusters (however it gives any clusters not co-occurring specifically and very few results … and occasionally segmentation fault, so I dropped it). Is there another tool that scans for co-occurring TFBSs is particular?

  • JASPAR offers a pre-scanned genome based on FIMO and TFBS Perl module - so I subsetted with the predictions of TFs of interest and the regions of interest (promoters).

  • REMAP 2018: offers merged peaks across ChIP-seq experiments for many TFs. Assuming that most binding sites are shared in different cell types / conditions, one could use these - correct? Again, I subsetted with the peaks of TFs of interest and the regions of interest (promoters). Unfortunately, not all TFs I am interested in are in this database. Any other similar resources available?

I am grateful for any insight: remarks to my solutions, other tools, other solutions, reviews on this problem ... anything :)

Thanks,

genome sequence • 2.9k views
ADD COMMENT
1
Entering edit mode

Assuming that most binding sites are shared in different cell types

I strongly disagree. TF binding is highly dynamic and depends on open chromatin regions, which again are highly dynamic and cell-type-specific, leave alone conditions, even more dynamic than gene expression.

ADD REPLY
0
Entering edit mode

Thank your for your comment. Isn't that more motif activity than the binding site? the binding sites should be fairly shared shouldn't they? If they are active however is a different issue and dependent on the system. Otherwise, the whole remap merged peaks would be nonsense, wouldn't it? Thank you for any insight.

ADD REPLY
5
Entering edit mode
6.0 years ago

One thing we do locally is maintain a Starch file containing whole-genome FIMO hits or calls at some threshold, formatted as BED intervals, i.e., :

$ unstarch /net/seq/data/projects/motifs/fimo/hg19.jaspar.1e-4/fimo.combined.1e-4.parsed.starch | head
chr1    10003   10023   MA0073.1-RREB1  3.52314e-05     +       CCCTAACCCTAACCCTAACC
chr1    10009   10029   MA0073.1-RREB1  3.52314e-05     +       CCCTAACCCTAACCCTAACC
chr1    10015   10035   MA0073.1-RREB1  3.52314e-05     +       CCCTAACCCTAACCCTAACC
chr1    10021   10041   MA0073.1-RREB1  3.52314e-05     +       CCCTAACCCTAACCCTAACC
...

With this, we can take promoter regions (formatted as a sorted BED file), e.g.:

$ echo -e "chr9\t133252000\t133253000" > /tmp/adHocPromoter.bed

And do some operations with BEDOPS bedmap to look for all TFBS that overlap the promoters by one or more bases:

$ bedmap --chrom chr9 --echo --echo-map-id-uniq --delim '\t' /tmp/adHocPromoter.bed /net/seq/data/projects/motifs/fimo/hg19.jaspar.1e-4/fimo.combined.1e-4.parsed.starch
chr9    133252000       133253000       MA0014.1-Pax5;MA0017.1-NR2F1;MA0018.2-CREB1;MA0019.1-Ddit3::Cebpa;MA0028.1-ELK1;MA0039.2-Klf4;MA0047.1-Foxa2;MA0048.1-NHLH1;MA0061.1-NF-kappaB;MA0062.2-GABPA;MA0065.1-PPARG::RXRA;MA0065.2-PPARG::RXRA;MA0071.1-RORA_1;MA0076.1-ELK4;MA0079.1-SP1;MA0079.2-SP1;MA0088.1-znf143;MA0102.1-Cebpa;MA0111.1-Spz1;MA0112.1-ESR1;MA0112.2-ESR1;MA0116.1-Zfp423;MA0119.1-TLX1::NFIC;MA0137.2-STAT1;MA0138.1-REST;MA0138.2-REST;MA0139.1-CTCF;MA0141.1-Esrrb;MA0144.1-Stat3;MA0145.1-Tcfcp2l1;MA0146.1-Zfx;MA0150.1-NFE2L2;MA0159.1-RXR::RAR_DR5;MA0160.1-NR4A2;MA0163.1-PLAG1;MA0258.1-ESR2

The last column of output is a list of co-occuring TF hits with a p-value of 1e-4 or less, which can be piped into R or other tools for statistical calculations.

By keeping the Starch archive of whole-genome hits, we can quickly perform repeated scans for FIMO hits of various thresholds (e.g. the more stringent 1e-5) or over windows from promoters, enhancers, intron-exon junctions, etc. We just do the FIMO analysis once, whole-genome, and then do set operations as the experiment parameters change.

ADD COMMENT
0
Entering edit mode

Wow, thank you so much for your comment! This is a very efficient way of dealing with it. I will look into this.

Apart from the efficiency it's basically the same as overlapping the JASPAR track (which is basically genome-wide FIMO hits) with my selected TFs and the windows of promoters. So this is generally an accepted approach?

May I asked how you define the promoters? I have selected the human TSSs from EPDnew and then added 1000bp upstream and 100bp downstream. I wonder what is really sensible here as the regions where they can bind can be far upstream not only 1000bp but if I select more I get an incredible high rate of false positives.

And finally are there any public resource of enhancer regions annotated with genes in human? Like the EPD for TSSs. Thank you!

ADD REPLY
1
Entering edit mode

Yes, I believe this is an accepted approach. Promoters are whatever you want to call them, but proximal promoters generally go upstream of the stranded TSS by 500-2500nt.

While JASPAR may offer its own tracks, you may want to consider the use of FIMO with other published and/or curated TF model databases, including TRANSFAC, UniPROBE, Taipale, etc. JASPAR is only one database. And there are other reference genomes, so depending on what you are working with, being able to make your own set of hits can be useful.

You might investigate 5C datasets for long-range genomic interactions, or basically any chromatin structure changes that are cell-type specific. CRISPR or TALEN knockout experiments may also indicate connections in a regulatory network. You might even look in the proximal promoter regions of TFs, to look for secondary regulation by other TFs. From these you could build a more comprehensive regulatory picture.

For genes of interest, you might run their sequence data through a tool like epilogos MEME, which uses histone modifications to calculate a kind of "sequence logo" of chromatin state information measurements. One such state includes the enhancer state, and the epilogos MEME tool can highlight enhancers proximal to the gene. Perhaps see epilogos for more background and then epilogos MEME to run your dataset of interest.

You might also look at VISTA for experimentally validated enhancers, or look at enhancer-atlas for further resources.

ADD REPLY
0
Entering edit mode

]thank you so much for your help and insight. I appreciate it very much.

I work with human hg38.

Thanks for the other databases - I sticked with JASPAR as it has all TFs I am interested in but I should check out the others - maybe they have different/better motifs. The motifs of two TFs I am interested in are very short, which I know can be the case but maybe there are better ones available. Is there a system to pick the "best" motif from all the motifs in the databases? Or do you run them all and merge the results? I picked the latest version number (if more were available) from JASPAR. I also saw that HOMER provided motifs but was stuck as I could not convert the different motif formats. Is there a tool that can convert between all of them? That would be very useful. FIMO only takes the meme format.

Unfortunately, I do not have any system-specific data. I saw that FIMO can take priors and there are many other tools like IMAGE that can do more complex analysis but then I need RNA-seq/ChIP-seq data. These would then give me active motifs / motif activity / causal transcription factors. I would have to check if something is publicly available but I suspect if I find something its just expression data from one of the conditions. If I find expression data, I could either subset the promoters with the expressed genes in one condition / differentially expressed genes between both conditions or use something like ISMARA, couldn't I? Then the chance that the binding sites are indeed active and contribute to regulation are higher.

The VISTA database is a great link! Thanks again,

ADD REPLY
0
Entering edit mode

Hi Alex. Incidentally I was facing a similar issue and liked your solution. However, I was wondering how memory intensive it is to run FIMO on the whole genome. I tried and ran into an error (“—max-stored-scores to allocate more space for storing motif matches”). Do you have any recommendations on running FIMO across the whole genome? Thank you in advance.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Great. Thank you Alex for the wonderful writeup.

ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6