Probability Of Finding A Tri-Nucleotide Motif In A Sequence Of Known Composition
2
1
Entering edit mode
10.1 years ago
Adam Taranto ▴ 40

I am trying to identify genomic windows in which the sequence is enriched in CpHpG or CpHpH motifs. Where H is any base that is not a G [ATC].

I have done this already for the di-nucleotide motif CpG by setting a threshold value for Obs/Exp as per EMBOSS's cpgplot, wherein:

Observed = Count of CpG di-nucleotides in a window
Expected = (number of C's * number of G's) / window length​

It is easy enough to count the number of CpHpG motifs in a given window, as these cannot overlap. However, I'm not sure how to best calculate the number of CpHpG's that would be observed by chance in a sequence of known length where we know the proportion of all bases.

Any suggestions appreciated!

motif sequence • 3.0k views
ADD COMMENT
1
Entering edit mode
9.9 years ago

Hi Adam,

A while ago I wrote a java program to test whether a regular expression appears in a sequence more often than expected by chance given the nucleotide composition of the sequence. P-value for enrichment is obtained by shuffling the sequence either by simple random sampling with replacement or by using a Markov model of given order.

I put it here, source is here.

It might suite the task is addressing.

Show the help:

java -jar RegexEnrichment.jar -h

A simple test fasta sequence:

echo ">seq1" > test.fa
echo "AAAAAAAAAAAACTGAAAAAAAAACAGAAAAAAAAAAAAACAGAAAAAAAAAAAACTGAAAAAAAAACAGAAAAAAAAAAAAACAG" >> test.fa

Example:

Get the enrichment of the CHG pattern in windows of 40bp. Shuffle sequences using a first order Markov model. By default, it returns a line in bed format for each pattern found. If you are interested only in windows, pipe the output to awk & uniq:

java -jar RegexEnrichment.jar -w 40 --norevcomp -omc 1 -p 'C[ACT]G' -f test.fa \
| awk 'BEGIN{OFS="\t"}{print $1, $7, $8, $5}'\
| uniq

seq1    0     40    0.3750
seq1    40    80    0.0800
seq1    80    86    0.2200

Last column is an empirical p-value for the pattern being enriched by chance in each tested window. (Windows without pattern are not outputted).

Any feedback welcome...

ADD COMMENT
1
Entering edit mode
9.9 years ago
Asaf 10k

You're facing another problem here which is the bias of di-nucleotides. If for instance your sequence is is enriched with CpH and HpG you will observe more CHG than expected when counting single nucleotides (C's * H's * G's) but it might not be very surprising taking into account the occurences of CH's and HG's. I think than the best way to determine enrichment here would be to generate random sequences using the di-nucleotide frequencies and for each such random sequence count the number of CHG's and compute an empirical p-value (number of shuffles for which the CHG was >= real sequence +1/ number of shuffles+1).

You can use this answer (of brentp) to generate shuffled sequences that sample from the same distribution of di-nucleotides.

ADD COMMENT

Login before adding your answer.

Traffic: 2035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6