Question: How can I shuffle a set of sequences while preserving the dinucleotide distribution?
gravatar for glocke01
5.6 years ago by
United States
glocke01180 wrote:

I have a set of 500 100-base sequences.  I want to make new sequences by randomizing this set into 500 new sequences such that the new set has the same dinucleotide distribution.  So, for each sequence in my original set, I'm cutting the sequence up into mononucleotides, putting them in a hat, and drawing them out one by one, but the randomized result should have the same dinucleotide distribution.  How do I do this?

The best idea I can come up with at present is to simply repeat a purely random shuffling until I find a result that is sufficiently close to the original.  (closeness may be assessed with a paired t-test or, more simply, by taking the difference between the original distribution and the random one.)  Is there a better way?  I could slice the original sequences by twos to jump-start the process?


background: I'm trying to reproduce a procedure described in in Weirauch et al. Nature 2012, (  In trying to evaluate a TF binding prediction algorithm's success at predicting in vivo binding sites, they start with some ChIP-Seq or ChIP-exo data.  From these measurements, they pick 500 loci where the TF is credibly believed to bind, and then test the algorithm's ability to distinguish the true sites from 500 control sites.  This is one of the methods they use to create control sites, and they describe it thusly: "500 randomly shuffled positive sequences, where dinucleotide frequencies were maintained." This is the whole description in the methods section, leaving a few things unclear.

  1. do i shuffle the sequences individually or as a set?  I mean, do I (a) slice all the sequences in the set and put them in a hat, and then test the dinuc distribution of the new set or (b) shuffle each sequence individually, as described above?
  2. How precisely is the dinucleotide distribution maintained?
validation random control dna • 6.9k views
ADD COMMENTlink modified 5.5 years ago by UnivStudent380 • written 5.6 years ago by glocke01180
gravatar for Paul Harrison
5.5 years ago by
Paul Harrison30 wrote:

This is a non-trivial problem involving Eulerian walks. uShuffle claims to provide a solution:

ADD COMMENTlink written 5.5 years ago by Paul Harrison30

And here's the reference for uShuffle: Jiang et al. BMC Bioinformatics 2008, 9:192  

ADD REPLYlink written 5.5 years ago by glocke01180

The updated web:
and a Cython wrapper:

ADD REPLYlink modified 16 months ago • written 16 months ago by samzhao10
gravatar for UnivStudent
5.5 years ago by
UnivStudent380 wrote:

Most people use this script ( to shuffle sequences while maintaining dinucleotide frequency (It's even code that's present in the MEME suite). written originally by P. Clote,  or a derivative of it. I found this copy online in the Bias-Away package by the Wasserman lab. It works by shuffling each sequence to make a permuted sequence with the original dinucleotide content. So the entire procedure is to shuffle each peak sequence and compare TF binding scores on the original sequences versus all scrambled sequences.

ADD COMMENTlink written 5.5 years ago by UnivStudent380

i got a hold of this same script.  I haven't taken the time to read it and figure out what it's actually doing, but I can tell you that the eponymous Altschul and Erickson published their algorithm in Mol Biol Evol (1985) 2 (6):526-538. 

ADD REPLYlink written 5.5 years ago by glocke01180
gravatar for Michael Dondrup
5.6 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

You need to reduce the problem to a single shuffling step such that for each such permutation generated, the di-nucleotide frequencies are the same. An example, imagine you want to shuffle the first C, position


Look for all locations where there is a AT, choose one randomly and insert C.



 Each step trivially preserves the di-nucleotide frequency.


ADD COMMENTlink written 5.6 years ago by Michael Dondrup47k

Nice! Mine algorithm cant handle such cases, while yours cant handle cases such as AGGGCATTTTC -> ATTTTCAGGGC. So I suggest applying the algorithms one after another will give a good result :)

ADD REPLYlink written 5.6 years ago by mikhail.shugay3.4k
gravatar for matted
5.6 years ago by
Boston, United States
matted7.3k wrote:

You can solve this with a hidden Markov model over dinucleotides.  This is a standard approach in areas like motif identification, where people generate background models with fixed dinucleotide (or trinucleotide, or higher) frequencies.

You generate the transition matrix by observing which dinucleotides lead to which other ones in your training set, where the dinucleotide constraint means that you have to transition to a sequence that starts with the base that you ended with.  Then you simulate a random draw from the HMM by picking an initial dinucleotide (according to the marginal distribution) and then drawing from the (observed / target) transition distribution to add another base.  You do this until you read your desired simulated sequence length.

This will preserve the dinucleotide frequencies in an approximate way, and the approximation will be increasingly good as you generate more and longer sequences.

Chapter 3 of the classic "Biological Sequence Analysis" goes into this and the details of HMMs.

Edit:  Here's some code that does this, with selectable k-mer size.  It might crash if your final k-mer is unique (so it has nothing to transition to if it draws that one), but this is rare if you have long sequences.


ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by matted7.3k
gravatar for mikhail.shugay
5.6 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

A similar question: Creating Reads With Same Nucleotides And Dinucleotides- I.E. Scrambling Reads

Looks like quite a complex question.

I can suggest this algorithm, at least it passes test for equal dinucleotide frequency (runs as Groovy script):

The basic idea is that if there exists some position i, which separates the sequence into two subsequences with the same starting and ending base, those subsequences can be swapped. This idea is then applied recursively.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by mikhail.shugay3.4k
gravatar for thackl
5.6 years ago by
thackl2.8k wrote:

A simple way, but not necessarily elegant way:

1. determine your composition, e.g.

# A:10%, T=20%, G=30%, C=40%

2. generate your random nucleotides

r = rand between 0 and almost 1 # e.g. perl: rand(1)
if $r < 0.1            => A
if $r >= 0.1 or < 0.3  => T
if $r >= 0.3 or < 0.6  => G
if $r >= 0.6 or < 1    => C
ADD COMMENTlink written 5.6 years ago by thackl2.8k

op needs di-nucleotide frequencies, like AA: 0.1, AC: 0.2, AT: 0.05, ....

In fact this is more difficult than it seems at first glance.

ADD REPLYlink written 5.6 years ago by Michael Dondrup47k

Thoroughly reading helps ... Sorry about nonsense answer..


ADD REPLYlink written 5.6 years ago by thackl2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour