Question

How to find peaks which have specific TF motif sequence

2

Entering edit mode

4.7 years ago

km1986 ▴ 20

Now I perform motif analysis using HOMER software.

I want to extract the peaks which have a specific transcription factor motif from peak files, but I cannot figure out how to do it by reading tutorials.

For example, when I perform motif analysis on a peak file by HOMER using "findMotifsGenome.pl" command and "PU.1 motifs" are enriched as a result, then I want to know which peaks contain PU.1 motif in the peak file I analyzed.

I would appreciate it if you give me some advice.

Thanks in advance.

ChIP-Seq HOMER Motif analysis • 6.8k views

ADD COMMENT • link updated 4.7 years ago by ATpoint 81k • written 4.7 years ago by km1986 ▴ 20

0

Entering edit mode

Hi , have you figured it out? I am considering to do the same thing using HOMER.

ADD REPLY • link 4.5 years ago by JC ▴ 30

0

Entering edit mode

Something wrong with the below answer?

ADD REPLY • link 4.5 years ago by ATpoint 81k

3

Entering edit mode

No, thanks for the suggestions about FIMO.

I am using HOMER for the motif analysis and I got good results so I want to get the locations of the enriched motifs.

The peak locations could be found using HOMER as they described in the guideline: http://homer.ucsd.edu/homer/ngs/peakMotifs.html

Finding Instance of Specific Motifs

By default, HOMER does not return the locations of each motif found in the motif discovery process. To recover the motif locations, you must first select the motifs you're interested in by getting the "motif file" output by HOMER. You can combine multiple motifs in single file if you like to form a "motif library". To identify motif locations, you have two options:

Run findMotifsGenome.pl with the "-find <motif file="">" option. This will output a tab-delimited text file with each line containing an instance of the motif in the target peaks. The output is sent to stdout.

For example: findMotifsGenome.pl ERalpha.peaks hg18 MotifOutputDirectory/ -find motif1.motif > outputfile.txt

Run annotatePeaks.pl with the "-m <motif file="">" option (see the annotation section for more info). Chuck prefers doing it this way. This will output a tab-delimited text file with each line containing a peak/region and a column containing instance of each motif separated by commas to stdout

For example: annotatePeaks.pl ERalpha.peaks hg18 -m motif1.motif > outputfile.txt

ADD REPLY • link 4.5 years ago by JC ▴ 30

0

Entering edit mode

Cool, did not know Homer has an option to return specific motifs. Thanks, learned something new :)

ADD REPLY • link 4.5 years ago by ATpoint 81k

score 12 · Answer 1 · 2019-07-28

I use Find Individual Motif Occurrences (FIMO) from the MEME suite for this kind of analysis. It accepts a fasta file with sequences, e.g. use bedtools getfasta to convert your peaks to fasta format, and a position frequency matrix for the TF of interest, e.g. download from JASPAR or HOCOMOCO in MEME format. It then scans the sequences for significant similarity with the provided motif and returns the regions that match it:

In this example, lets check a stretch of DNA around the first exon of the human BCL6 gene for motif occurrences against all motifs listed in the JASPAR vertebrate core collection. In your case you should provide a fasta with all the sequences you are interested in.

Coordinates of the query sequence (hg38) chr3:187744307-187746589

## Get JASPAR motifs (vertebrate non-redundant core collection) in meme format:
wget http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_vertebrates_non-redundant_pfms_meme.zip

## Unzip:
unzip JASPAR2018_CORE_vertebrates_non-redundant_pfms_meme.zip

## Install fimo (part of MEME):
conda install -c bioconda meme

## if fimo complains about libiconv libraries, also install that manually:
conda install -c conda-forge libiconv 

## run fimo, providing the .meme file matching your TF:
fimo --parse-genomic-coord yourTF.meme input.fa

The input.fa here looks like:

>chr3:187744307-187746589
(sequence...)

When specifying the genomic coordinates of the sequence in the fasta header in the form chr-start:end (1-based coordinates) and using the --parse-genomic-coord option of fimo, the resulting GFF file will show the exact coordinates of the motif in the genome.

Check output in gff format which contains significant matches:

head fimo_out/fimo.gff


##gff-version 3
chr3    fimo    nucleotide_motif    187745593   187745603   43.9    -   .   Name=MA0002.2_chr3-;Alias=RUNX1;ID=MA0002.2-RUNX1-1-chr3;pvalue=4.11e-05;qvalue= 0.177;sequence=TCTTGTGGCTT;
chr3    fimo    nucleotide_motif    187746233   187746243   40.4    +   .   Name=MA0002.2_chr3+;Alias=RUNX1;ID=MA0002.2-RUNX1-2-chr3;pvalue=9.11e-05;qvalue= 0.196;sequence=GTTTGTGGTGT;
chr3    fimo    nucleotide_motif    187744975   187744985   41.1    +   .   Name=MA0003.3_chr3+;Alias=TFAP2A;ID=MA0003.3-TFAP2A-1-chr3;pvalue=7.81e-05;qvalue= 0.323;sequence=CCCCCCAAGCA;
chr3    fimo    nucleotide_motif    187745763   187745774   41.9    +   .   Name=MA0018.3_chr3+;Alias=CREB1;ID=MA0018.3-CREB1-1-chr3;pvalue=6.41e-05;qvalue= 0.146;sequence=TGTGACGTCGGC;
chr3    fimo    nucleotide_motif    187745763   187745774   41.9    -   .   Name=MA0018.3_chr3-;Alias=CREB1;ID=MA0018.3-CREB1-2-chr3;pvalue=6.41e-05;qvalue= 0.146;sequence=GCCGACGTCACA;
chr3    fimo    nucleotide_motif    187746240   187746250   50.7    -   .   Name=MA0025.1_chr3-;Alias=NFIL3;ID=MA0025.1-NFIL3-1-chr3;pvalue=8.51e-06;qvalue= 0.0387;sequence=TTACGTAACAC;
chr3    fimo    nucleotide_motif    187746378   187746388   40.5    +   .   Name=MA0025.1_chr3+;Alias=NFIL3;ID=MA0025.1-NFIL3-2-chr3;pvalue=8.97e-05;qvalue= 0.204;sequence=ATATGTAACAA;
chr3    fimo    nucleotide_motif    187745661   187745670   40.4    -   .   Name=MA0028.2_chr3-;Alias=ELK1;ID=MA0028.2-ELK1-1-chr3;pvalue=9.09e-05;qvalue= 0.412;sequence=ACCGGAACCT;
chr3    fimo    nucleotide_motif    187745215   187745225   47.4    +   .   Name=MA0032.2_chr3+;Alias=FOXC1;ID=MA0032.2-FOXC1-1-chr3;pvalue=1.81e-05;qvalue= 0.0779;sequence=TAAATAAATAT;