Question: Significance of motifs around mutations
gravatar for banerjeeshayantan
4 weeks ago by
banerjeeshayantan80 wrote:

I have a set of disease-causing mutations and their flanking neighborhoods (4bp on each side). I want to determine if these flanking neighborhoods are different from non-disease mutations
ATTG M TTGA (M=Mutation, disease-causing)
TTAG M GAGG (M=Mutation, non-disease causing)
How do I estimate the background distribution of all such 4bp on each side neighborhoods for the entire genome? Can you help me formulate a statistical test to differentiate between the two?

R assembly • 183 views
ADD COMMENTlink modified 22 days ago by geek_y9.1k • written 4 weeks ago by banerjeeshayantan80
gravatar for geek_y
22 days ago by
geek_y9.1k wrote:

Do you have a specific motif around disease causing mutation ? If you have one specific motif, you can check how often you would observe the same same motif around non-disease causing SNPs and perform a simple Fishers test.

If you don't have a prior motif/k-mer , You can take all your disease causing mutations and perform a k-mer enrichment type of analysis or a typical motif analysis to find enriched patterns

Do the same for non-disease causing SNPs and you may not find the similar k-mer or motifs around non-disease causing mutations.

PS: I'm not a geneticist.

ADD COMMENTlink modified 22 days ago • written 22 days ago by geek_y9.1k

Thanks for your reply. I have tried something very similar to your first suggestion. I tried finding motifs that are over-represented around disease-causing mutations as compared to non-disease. I also have a list of such motifs with their respective significance values. But the problem is I am unable to attach any biological significance to that. Say if I have AATG over-represented around a particular mutation type (C>G), how do I go back and check whether there's any biological basis for such behaviour. Any help is greatly appreciated!

ADD REPLYlink written 22 days ago by banerjeeshayantan80

That needs more information and biological context like if they are coding, non-coding, splice-site variants etc etc

ADD REPLYlink written 22 days ago by geek_y9.1k
gravatar for pltbiotech_tkarthi
23 days ago by
CIMMYT, Mexico
pltbiotech_tkarthi70 wrote:

If you have fasta sequence aligned format of your SNP sequences (or you can create aligned SNPs using snp-sites-linux based:, then you can import to Tassel or other online softwares to create a Hapmap file. Seperately, use your reference sequence in ENSEMBL blastn server to map the positions of your original reference sequence (not the snp file). Subsequently compare your positions your SNP positions from your SNP data and ensembl result to predict accurately and create a VCF file using Hapmap file again from Tassel software itself. Once, you have VCF file, use VEP ( at ENSEMBL to detect variant forms of sequences at nucleotide and protein level around the region of your interest.

ADD COMMENTlink written 23 days ago by pltbiotech_tkarthi70

This is nothing to do with what OP wants.

ADD REPLYlink written 22 days ago by geek_y9.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 788 users visited in the last hour