Question

Finding TF motifs enriched in series of ATAC-seq peaks (using fimo?)

0

Entering edit mode

11 months ago

octopuslegs11 ▴ 10

Hi all,

I have a series of peaks located in a .txt file (chr / start / end) and would like to know if there are tf motifs enriched in each of the individual peaks.

For eg, I am looking for an output that will eventually look something like this:

chr | start   | end    | tf motif 
1   | 100024  | 100288 | GATA1
1   | 153313  | 155590 | RUNX1
.
.
.

Where each row is a unique peak and the tf motif is the most significantly enriched.

I downloaded the JASPAR2022 core collection to get a set of PWMs for different TFs, which I then concatenated in to a single .meme file (following this post: Finding individual motif occurrences with FIMO from the MEME suite) and have started using the FIMO command line tool. However I can only figure out how to query a single fasta sequence at a time?

fimo --parse-genomic-coord /path/to/meme/combined.meme input.fa

Is there a way to do this such that I can query all 15,000 peaks at once, instead of doing them individually?

Thanks in advance.

MEME ATAC-Seq TF FIMO motif • 1.6k views

ADD COMMENT • link 11 months ago by octopuslegs11 ▴ 10

0

Entering edit mode

11 months ago

Trivas ★ 1.7k

Look into the Homer suite: http://homer.ucsd.edu/homer/motif/

ADD COMMENT • link 11 months ago by Trivas ★ 1.7k

0

Entering edit mode

11 months ago

jared.andrews07 ★ 16k

pymemesuite makes this pretty easy.

ADD COMMENT • link 11 months ago by jared.andrews07 ★ 16k

score 1 · Accepted Answer · 2023-05-10

1

Entering edit mode

11 months ago

Alex Reynolds 35k

A basic scenario for running FIMO: https://bioinformatics.stackexchange.com/questions/2467/where-to-download-jaspar-tfbs-motif-bed-file/2491#2491

This approach lets you generate a BED file containing genomic regions within a given reference genome, which are putative TF binding sites ("FIMO hits").

This BED file can be mapped via bedmap or bedops etc. to retrieve TF site calls located within peaks or other genomic regions, e.g.,

bedmap --echo --echo-map-id-uniq --delim '\t' peaks.bed fimoHits.bed > answer.bed

Make sure chromosome name schemes match between BED files, in order to do set operations. Your peaks might be using Ensembl's scheme (1, etc.), while the output from FIMO in my example will generate UCSC names (chr1, chr2, etc.). An awk preprocessing step can fix things in either file. It might also be helpful to run sort-bed on your peaks to ensure they are sorted before operations.

ADD COMMENT • link 11 months ago by Alex Reynolds 35k

1

Entering edit mode

Thanks - this is exactly what I was looking for! I appreciate your help.

ADD REPLY • link 11 months ago by octopuslegs11 ▴ 10

1

Entering edit mode

I corrected my answer. The usage scenario of FIMO I documented will generate a BED file of potential TF sites, not consume it.

ADD REPLY • link 11 months ago by Alex Reynolds 35k

0

Entering edit mode

Ok, this makes more sense. Thank you!

Out of curiosity, is it "better" to do it this way (ie scanning the entire genome for TF motifs, then getting the intersect with my peaks) as opposed to generating a fasta sequence for each of the 15000 peaks individually, and then scanning that for TF motifs? I guess another issue would be deciding what to use as the background model if I were to follow this latter route.

ADD REPLY • link 11 months ago by octopuslegs11 ▴ 10

0

Entering edit mode

If those are the regions you're interested in, then use those regions and create the background from them.

ADD REPLY • link 11 months ago by jared.andrews07 ★ 16k

0

Entering edit mode

I would do a whole-genome FIMO scan, using the reference genome (with UCSC blacklisted regions removed, say) as a background, generally. This takes slightly longer but creates a set of FIMO hits I can bedmap against any number of regions/peaks/whatever, whenever I need to. But it may depend on what you're trying to do. If you're just doing a one-off query, then the above advice is probably fine. If you might do this again on other sets of peaks, then a whole-genome set of hits may be a useful resource.