When performing motif identification using Homer, Homer will generate background sequences taking into account GC content of the provided sequences, unless specified otherwise. If an identical findMotifsGenome.pl command is run multiple times, a different set of background sequences will be generated. This will result in notable differences in the motifs identified as well as the TFs to which the motifs are assigned. Is there a way to extract the background sequence that was generated by homer such that it can be used to specify background for further motif analysis? I would like to confirm that the variation I am observing is in fact due to variation in the generated background and not some other issue.
Arguably more complicated question:
Repeatedly running findMotifsGenome.pl on the same set of peaks results in different lists of Motifs being output. While the structure (sequence / motif matrices) of the motifs identified are fairly maintained, the TFs to which they have been assigned are different. It is my understanding that this is to be expected as there is some degree of statistical variation in how Homer selects background sequences. This will result in slight differences in the motif structure and the associated TFs. While some of the TFs remain consistent, there are subsets that differ between runs or are entirely novel with each run through. As someone who is interested in deriving some degree of meaning from the TFs that are associated with the identified motifs this presents an issue.
Below is an example of what I am talking about:
All three of these outputs were generated from the same command and using the same files. While they each contain some consistently identified TFs CDX1, NR4A1, Rbpj1, and Znf410, there are many TFs that were not repeatedly identified. However, each motif, not TF, appears to have been identified in every repeat run through. i.e. the motif for Znf740 is practically identical to Plag11, the motif for Sox3 is practically identical to Gata4, FoxH1 is very similar to Sox14, so on and so forth.
In this way can any information reliably be drawn from the named transcription factors? Knowing the motif sequences alone doesn't really carry any independent biological meaning as functionality is attributed to the TFs, not the motif sequences (This is sort of a loose statement, but I think it gets at what I am trying to communicate). Is there a best practice for handling something like this? WHat is the most appropriate approach to take moving forward with variation like this?
As far as I can determine, while there have been many questions asked here on biostars regarding this issue, there has not been much guidance on the best way to proceed with a data analysis pipeline or how to best draw meaning from the Homer results. HOMER motif discovery gives inconsistent results on repeated runs
Any and all helpful input is greatly appreciated!
All the best.