Question

Deleted:Comparing motif occurrence with TF binding sites

1

Entering edit mode

2.8 years ago

Hughie ▴ 30

Hi, biostars,
I recently got a comment on my analysis about the "motif occurrence", I have struggled for two weeks, but don't understand what it means. I'm a freshman in this analysis, so, I post this question looking for some suggestions, thank you for any comments in advance!

This analysis is going to answer "To what proportion the motif can explain the binding specificity of transcription factor x (TFx)".

enter image description here

Suppose we have ChIP-seq data for TFx in human, mouse, zebrafish, and fly. Notably, TFx is highly conservative in these species. After peakcalling, I used MEME to search for the TFx motif in each species (Figure A). Notably, the motif was corrected using the genomic GC content for each species.

meme -mod zoops -pal -revcomp binding_sites.fasta

The next I want to know is how many binding sites fall within the TFx motif along the genome. To this end, I used the corresponding position weight matrix (PWM) returned by MEME in each species to predict motif sites along the genome in that species, using FIMO.

fimo --thresh 0.001 TFx_PWM_human.meme human_genome.fasta

Suppose I got a list of 10,000 predicted TFx motif sites in the human genome, I have 2000 TFx binding sites. I calculated the overlap between all TF binding sites with predicted motif, and the percentage of binding sites that fall insides/outside predicted motifs (19%/81%)(Figure B). Then, I want to know how many predicted motifs are used by TFx. By overlapping, I found 34% of predicted motifs were used by TFx (Figure C). A similar analysis was done separately for each species.

This analysis leads me to the conclusion that motif is not sufficient for explaining TFx's binding specificity and other parameters that might exist for regulating. I got a professional comment for this analysis:

I think there lacks an explanation for the statistical model underlying the motif occurrence searching. I understand that some pre-existing software was used for this, but that does not mean the built-in assumptions are appropriate for the questions in this analysis. Given a motif (represented as a position-frequency matrix, and possibly corrected for genomic nucleotide content), there are numerous ways that this motif can be transformed into a position-specific scoring matrix, and then many different ways that the corresponding scores can be assessed for statistical significance. Because there is no universally appropriate way to do these things, the precise technique used should be tailored to the statistical question asked. Or at least clearly justified. In this case, you are comparing TFx binding sites with the locations of motif occurrences. Those occurrences may be more or less abundant in certain species, for artifactual reasons. In comparisons such as those described in your analysis, I would want criteria that puts all the species on a level playing field, and most ways to do this would result in almost the same frequency of sites being identified (at a particular statistical cutoff) in the different species. I am almost of the opinion that the identical number of sites should be found in every species, but would need to think more deeply about that. These cutoffs are also important when comparing two motifs: if the information content in two motifs differs, then the expected number of occurrences using some common models might be very different. The information content seems to be different for the motifs in FigureA.

I currently took apart this comment into several points:

Why the motif look differently for each species? I think in the ideal situation, the TFx should have exactly the same binding specificity because of conservation. However, the motif was calculated posteriorly from a lot of input sequence (binding sites), and there might exist other parameters affect the TFx (like cofactors). So, this explains why these four motifs are different in information content.
Since these 4 species have different genome sequences, it's intuitive that using the species-specific motif for FIMO prediction will result in different numbers of predicted motif sites. Besides, I think even using the human TFx motif for searching in 4 species will also result in different motif numbers. So, it's really difficult for me to understand why this comment says "almost the same frequency of sites being identified (at a particular statistical cutoff) in the different species. I am almost of the opinion that the identical number of sites should be found in every species"?
I have no idea on "criteria that puts all the species on a level playing field" currently.

Many thanks for your time for reading this question and any comments, references, criticisms are appreciated!

FIMO motif TF MEME • 574 views

ADD COMMENT • link 2.8 years ago by Hughie ▴ 30