Finding TF motifs enriched in series of ATAC-seq peaks (using fimo?)
3
0
Entering edit mode
11 months ago

Hi all,

I have a series of peaks located in a .txt file (chr / start / end) and would like to know if there are tf motifs enriched in each of the individual peaks.

For eg, I am looking for an output that will eventually look something like this:

chr | start   | end    | tf motif 
1   | 100024  | 100288 | GATA1
1   | 153313  | 155590 | RUNX1
.
.
.

Where each row is a unique peak and the tf motif is the most significantly enriched.

I downloaded the JASPAR2022 core collection to get a set of PWMs for different TFs, which I then concatenated in to a single .meme file (following this post: Finding individual motif occurrences with FIMO from the MEME suite) and have started using the FIMO command line tool. However I can only figure out how to query a single fasta sequence at a time?

fimo --parse-genomic-coord /path/to/meme/combined.meme input.fa

Is there a way to do this such that I can query all 15,000 peaks at once, instead of doing them individually?

Thanks in advance.

MEME ATAC-Seq TF FIMO motif • 1.6k views
ADD COMMENT
1
Entering edit mode
11 months ago

A basic scenario for running FIMO: https://bioinformatics.stackexchange.com/questions/2467/where-to-download-jaspar-tfbs-motif-bed-file/2491#2491

This approach lets you generate a BED file containing genomic regions within a given reference genome, which are putative TF binding sites ("FIMO hits").

This BED file can be mapped via bedmap or bedops etc. to retrieve TF site calls located within peaks or other genomic regions, e.g.,

bedmap --echo --echo-map-id-uniq --delim '\t' peaks.bed fimoHits.bed > answer.bed

Make sure chromosome name schemes match between BED files, in order to do set operations. Your peaks might be using Ensembl's scheme (1, etc.), while the output from FIMO in my example will generate UCSC names (chr1, chr2, etc.). An awk preprocessing step can fix things in either file. It might also be helpful to run sort-bed on your peaks to ensure they are sorted before operations.

ADD COMMENT
1
Entering edit mode

Thanks - this is exactly what I was looking for! I appreciate your help.

ADD REPLY
1
Entering edit mode

I corrected my answer. The usage scenario of FIMO I documented will generate a BED file of potential TF sites, not consume it.

ADD REPLY
0
Entering edit mode

Ok, this makes more sense. Thank you!

Out of curiosity, is it "better" to do it this way (ie scanning the entire genome for TF motifs, then getting the intersect with my peaks) as opposed to generating a fasta sequence for each of the 15000 peaks individually, and then scanning that for TF motifs? I guess another issue would be deciding what to use as the background model if I were to follow this latter route.

ADD REPLY
0
Entering edit mode

If those are the regions you're interested in, then use those regions and create the background from them.

ADD REPLY
0
Entering edit mode

I would do a whole-genome FIMO scan, using the reference genome (with UCSC blacklisted regions removed, say) as a background, generally. This takes slightly longer but creates a set of FIMO hits I can bedmap against any number of regions/peaks/whatever, whenever I need to. But it may depend on what you're trying to do. If you're just doing a one-off query, then the above advice is probably fine. If you might do this again on other sets of peaks, then a whole-genome set of hits may be a useful resource.

ADD REPLY
0
Entering edit mode

Got it, this makes sense. Thanks again for your advice!

ADD REPLY
0
Entering edit mode
11 months ago
Trivas ★ 1.7k

Look into the Homer suite: http://homer.ucsd.edu/homer/motif/

ADD COMMENT
0
Entering edit mode
11 months ago

pymemesuite makes this pretty easy.

ADD COMMENT

Login before adding your answer.

Traffic: 1923 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6