I have a list of mutations of interest from the coding region in an experiment that I am performing. I have the mutation position, base substitution type (C>T, A>G, etc), Chromosome, and Gene name as input data. Now I was curious to explore the sequences surrounding those particular mutational positions. To do that, I extracted the raw nucleotide sequences 10 bases up and downstream of the mutation position and plot the sequence logos for the same. This is the image.

  • One thing to note from this image is that C and G nucleotides are highly conserved in the majority of the locations. How do I build a background model for this and argue that whatever I am noticing here is not by chance and is significant?
  • Also, I was also thinking about extracting motifs from the flanking nucleotides and see whether there is an overrepresentation of certain sequence motifs around the mutations. Given I am new to this field, is there a systematic way to do that?


Your seqlogo image shows the same proportion of each nucleotide at every location. If you want to get a 'conservation' score out of your region, you need to give it other species' sequences for context. That's going to be tricky to define. Why not just download a public conservation track for the region.

I understand your point. Can you tell me a bit more about downloading a "public conservation track for the region"? Where can I find this?

I don't know your experimental design or species of interest, but this exon you'll see is missing in chimpanzee.

