I have a list of mutations of interest from the coding region in an experiment that I am performing. I have the mutation position, base substitution type (C>T, A>G, etc), Chromosome, and Gene name as input data. Now I was curious to explore the sequences surrounding those particular mutational positions. To do that, I extracted the raw nucleotide sequences 10 bases up and downstream of the mutation position and plot the sequence logos for the same. This is the image.
- One thing to note from this image is that C and G nucleotides are highly conserved in the majority of the locations. How do I build a background model for this and argue that whatever I am noticing here is not by chance and is significant?
- Also, I was also thinking about extracting motifs from the flanking nucleotides and see whether there is an overrepresentation of certain sequence motifs around the mutations. Given I am new to this field, is there a systematic way to do that?