Question

Probability Of Finding A Tri-Nucleotide Motif In A Sequence Of Known Composition

1

Entering edit mode

10.1 years ago

Adam Taranto ▴ 40

I am trying to identify genomic windows in which the sequence is enriched in CpHpG or CpHpH motifs. Where H is any base that is not a G [ATC].

I have done this already for the di-nucleotide motif CpG by setting a threshold value for Obs/Exp as per EMBOSS's cpgplot, wherein:

Observed = Count of CpG di-nucleotides in a window
Expected = (number of C's * number of G's) / window length

It is easy enough to count the number of CpHpG motifs in a given window, as these cannot overlap. However, I'm not sure how to best calculate the number of CpHpG's that would be observed by chance in a sequence of known length where we know the proportion of all bases.

Any suggestions appreciated!

motif sequence • 3.0k views

ADD COMMENT • link updated 9.9 years ago by dariober 14k • written 10.1 years ago by Adam Taranto ▴ 40

1

Entering edit mode

9.9 years ago

Asaf 10k

You're facing another problem here which is the bias of di-nucleotides. If for instance your sequence is is enriched with CpH and HpG you will observe more CHG than expected when counting single nucleotides (C's * H's * G's) but it might not be very surprising taking into account the occurences of CH's and HG's. I think than the best way to determine enrichment here would be to generate random sequences using the di-nucleotide frequencies and for each such random sequence count the number of CHG's and compute an empirical p-value (number of shuffles for which the CHG was >= real sequence +1/ number of shuffles+1).

You can use this answer (of brentp) to generate shuffled sequences that sample from the same distribution of di-nucleotides.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Asaf 10k

Ram · Accepted Answer · 2014-05-24

Hi Adam,

A while ago I wrote a java program to test whether a regular expression appears in a sequence more often than expected by chance given the nucleotide composition of the sequence. P-value for enrichment is obtained by shuffling the sequence either by simple random sampling with replacement or by using a Markov model of given order.

I put it here, source is here.

It might suite the task is addressing.

Show the help:

java -jar RegexEnrichment.jar -h

A simple test fasta sequence:

echo ">seq1" > test.fa
echo "AAAAAAAAAAAACTGAAAAAAAAACAGAAAAAAAAAAAAACAGAAAAAAAAAAAACTGAAAAAAAAACAGAAAAAAAAAAAAACAG" >> test.fa

Example:

Get the enrichment of the CHG pattern in windows of 40bp. Shuffle sequences using a first order Markov model. By default, it returns a line in bed format for each pattern found. If you are interested only in windows, pipe the output to awk & uniq:

java -jar RegexEnrichment.jar -w 40 --norevcomp -omc 1 -p 'C[ACT]G' -f test.fa \
| awk 'BEGIN{OFS="\t"}{print $1, $7, $8, $5}'\
| uniq

seq1    0     40    0.3750
seq1    40    80    0.0800
seq1    80    86    0.2200

Last column is an empirical p-value for the pattern being enriched by chance in each tested window. (Windows without pattern are not outputted).

Any feedback welcome...