Question: Significance of motifs around mutations
gravatar for banerjeeshayantan
14 months ago by
banerjeeshayantan160 wrote:

I have a set of disease-causing mutations and their flanking neighborhoods (4bp on each side). I want to determine if these flanking neighborhoods are different from non-disease mutations
ATTG M TTGA (M=Mutation, disease-causing)
TTAG M GAGG (M=Mutation, non-disease causing)
How do I estimate the background distribution of all such 4bp on each side neighborhoods for the entire genome? Can you help me formulate a statistical test to differentiate between the two?

R assembly • 405 views
ADD COMMENTlink modified 14 months ago by geek_y10k • written 14 months ago by banerjeeshayantan160
gravatar for geek_y
14 months ago by
geek_y10k wrote:

Do you have a specific motif around disease causing mutation ? If you have one specific motif, you can check how often you would observe the same same motif around non-disease causing SNPs and perform a simple Fishers test.

If you don't have a prior motif/k-mer , You can take all your disease causing mutations and perform a k-mer enrichment type of analysis or a typical motif analysis to find enriched patterns

Do the same for non-disease causing SNPs and you may not find the similar k-mer or motifs around non-disease causing mutations.

PS: I'm not a geneticist.

ADD COMMENTlink modified 14 months ago • written 14 months ago by geek_y10k

Thanks for your reply. I have tried something very similar to your first suggestion. I tried finding motifs that are over-represented around disease-causing mutations as compared to non-disease. I also have a list of such motifs with their respective significance values. But the problem is I am unable to attach any biological significance to that. Say if I have AATG over-represented around a particular mutation type (C>G), how do I go back and check whether there's any biological basis for such behaviour. Any help is greatly appreciated!

ADD REPLYlink written 14 months ago by banerjeeshayantan160

That needs more information and biological context like if they are coding, non-coding, splice-site variants etc etc

ADD REPLYlink written 14 months ago by geek_y10k
gravatar for pltbiotech_tkarthi
14 months ago by
CIMMYT, Mexico
pltbiotech_tkarthi180 wrote:

If you have fasta sequence aligned format of your SNP sequences (or you can create aligned SNPs using snp-sites-linux based:, then you can import to Tassel or other online softwares to create a Hapmap file. Seperately, use your reference sequence in ENSEMBL blastn server to map the positions of your original reference sequence (not the snp file). Subsequently compare your positions your SNP positions from your SNP data and ensembl result to predict accurately and create a VCF file using Hapmap file again from Tassel software itself. Once, you have VCF file, use VEP ( at ENSEMBL to detect variant forms of sequences at nucleotide and protein level around the region of your interest.

ADD COMMENTlink written 14 months ago by pltbiotech_tkarthi180

This is nothing to do with what OP wants.

ADD REPLYlink written 14 months ago by geek_y10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1868 users visited in the last hour