Best Practices For Using Fimo For Motif Scanning
1
5
Entering edit mode
7.8 years ago
UnivStudent ▴ 430

Hi everyone,

I'm having a bit of trouble finding out how to use fimo to discover TFBS. The questions I had were:

1. How do you decide on good thresholds for PWM matching (--thresh)
2. How much do the backgrounds matter, should these be calculated from the sequences you're submitting?
3. Does how does Fimo handle masked fasta files? Is it better to submit hard of soft masked files?

Any other tips on good usage for this program would be much appreciated as I'm finding the documentation quite vague.

meme • 5.5k views
3
Entering edit mode
7.8 years ago
Ying W ★ 4.1k

My response is probably not that great but since nobody else is answering this question:

1. you probably want to set a q-value of something like <0.05 or <0.01 since those cutoffs are pretty standard
2. backgrounds matter a lot, they should be calculated from the sequence that you are looking for motifs in.
3. how are you masking the files? with upper and lower cases or with X/Ns? you should be submitting sequence that you want to look for motifs in. If you are uninterested in motifs in repetitive regions, then you should mask them out (with Ns).

Lastly, have you tried using HOMER? The documentation on it is quite extensive and you might find it easier to use

0
Entering edit mode
1. Should you be using genome wide averages? And does it automatically calculate it automatically from the sequence files?
2. Currently I have the soft-masked (lowercase) but maybe I should consider hard masking.

Also I'll look into HOMER, the main reason I started trying to use FIMO is because it seems to be the status quo in the literature for this type of thing.

0
Entering edit mode

For 1) I am assuming you are asking about background. You should not be using genome wide averages and instead be using background from sequence file. On the FIMO page it details the background file, you can use the command •--bgfile motif-file to generate background from sequence file

MEME suite might have been the status quo a couple years ago but I believe more labs now are using HOMER but each tool has their niche

0
Entering edit mode

But wouldn't the sequence file contain subsequences that match the motif you are searching for? This would result in background frequencies estimates that are skewed towards sequences containing motifs, which would increase the false discovery rate of motif searches.

I didn't quite follow what is wrong with using the whole genome to give you the background. If you are searching for motifs in non-coding regions in, say, the human genome, > 97% is non-coding. I doubt the 3% that is coding would skew your background estimates very much.

Even if one did not want to use the whole genome, perhaps looking at non-coding regions nearby your regions of interest would provide a less biased background estimate, instead of using the fasta file containing the sequences of interest.

Am I missing something here?

2
Entering edit mode

What I meant by that comment is that sequence context will often differ between the regions that transcription factors bind versus the whole genome. See the autonormalization point here: http://homer.salk.edu/homer/motif/index.html

0
Entering edit mode

Ah, now I see what you mean. Thanks :)