Question: How can I shuffle a set of sequences while preserving the dinucleotide distribution?
5
4.5 years ago by
glocke01160
United States
glocke01160 wrote:

I have a set of 500 100-base sequences.  I want to make new sequences by randomizing this set into 500 new sequences such that the new set has the same dinucleotide distribution.  So, for each sequence in my original set, I'm cutting the sequence up into mononucleotides, putting them in a hat, and drawing them out one by one, but the randomized result should have the same dinucleotide distribution.  How do I do this?

The best idea I can come up with at present is to simply repeat a purely random shuffling until I find a result that is sufficiently close to the original.  (closeness may be assessed with a paired t-test or, more simply, by taking the difference between the original distribution and the random one.)  Is there a better way?  I could slice the original sequences by twos to jump-start the process?

background: I'm trying to reproduce a procedure described in in Weirauch et al. Nature 2012, (http://www.nature.com/nbt/journal/v31/n2/full/nbt.2486.html?WT.ec_id=NBT-201302).  In trying to evaluate a TF binding prediction algorithm's success at predicting in vivo binding sites, they start with some ChIP-Seq or ChIP-exo data.  From these measurements, they pick 500 loci where the TF is credibly believed to bind, and then test the algorithm's ability to distinguish the true sites from 500 control sites.  This is one of the methods they use to create control sites, and they describe it thusly: "500 randomly shuffled positive sequences, where dinucleotide frequencies were maintained." This is the whole description in the methods section, leaving a few things unclear.

1. do i shuffle the sequences individually or as a set?  I mean, do I (a) slice all the sequences in the set and put them in a hat, and then test the dinuc distribution of the new set or (b) shuffle each sequence individually, as described above?
2. How precisely is the dinucleotide distribution maintained?
validation random control dna • 5.7k views
modified 4.4 years ago by UnivStudent380 • written 4.5 years ago by glocke01160
3
4.4 years ago by
Australia
Paul Harrison30 wrote:

This is a non-trivial problem involving Eulerian walks. uShuffle claims to provide a solution:

http://digital.cs.usu.edu/~mjiang/ushuffle/

And here's the reference for uShuffle: Jiang et al. BMC Bioinformatics 2008, 9:192 http://www.biomedcentral.com/1471-2105/9/192

1

The updated web: https://cs.usu.edu/people/MinghuiJiang/ushuffle/
and a Cython wrapper: https://github.com/guma44/ushuffle

2
4.4 years ago by
UnivStudent380
UnivStudent380 wrote:

Most people use this script (altschulEriksonDinuclShuffle.py) to shuffle sequences while maintaining dinucleotide frequency (It's even code that's present in the MEME suite). written originally by P. Clote,  or a derivative of it. I found this copy online in the Bias-Away package by the Wasserman lab. It works by shuffling each sequence to make a permuted sequence with the original dinucleotide content. So the entire procedure is to shuffle each peak sequence and compare TF binding scores on the original sequences versus all scrambled sequences.

1

i got a hold of this same script.  I haven't taken the time to read it and figure out what it's actually doing, but I can tell you that the eponymous Altschul and Erickson published their algorithm in Mol Biol Evol (1985) 2 (6):526-538.

3
4.5 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

You need to reduce the problem to a single shuffling step such that for each such permutation generated, the di-nucleotide frequencies are the same. An example, imagine you want to shuffle the first C, position

```ACTGGGGATG.....
```

Look for all locations where there is a AT, choose one randomly and insert C.

`ACTGGGGACTG`

Repeat.

Each step trivially preserves the di-nucleotide frequency.

Nice! Mine algorithm cant handle such cases, while yours cant handle cases such as AGGGCATTTTC -> ATTTTCAGGGC. So I suggest applying the algorithms one after another will give a good result :)

3
4.5 years ago by
matted7.1k
Boston, United States
matted7.1k wrote:

You can solve this with a hidden Markov model over dinucleotides.  This is a standard approach in areas like motif identification, where people generate background models with fixed dinucleotide (or trinucleotide, or higher) frequencies.

You generate the transition matrix by observing which dinucleotides lead to which other ones in your training set, where the dinucleotide constraint means that you have to transition to a sequence that starts with the base that you ended with.  Then you simulate a random draw from the HMM by picking an initial dinucleotide (according to the marginal distribution) and then drawing from the (observed / target) transition distribution to add another base.  You do this until you read your desired simulated sequence length.

This will preserve the dinucleotide frequencies in an approximate way, and the approximation will be increasingly good as you generate more and longer sequences.

Chapter 3 of the classic "Biological Sequence Analysis" goes into this and the details of HMMs.

Edit:  Here's some code that does this, with selectable k-mer size.  It might crash if your final k-mer is unique (so it has nothing to transition to if it draws that one), but this is rare if you have long sequences.

2
4.5 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

Looks like quite a complex question.

I can suggest this algorithm, at least it passes test for equal dinucleotide frequency (runs as Groovy script):

The basic idea is that if there exists some position i, which separates the sequence into two subsequences with the same starting and ending base, those subsequences can be swapped. This idea is then applied recursively.

0
4.5 years ago by
thackl2.7k
MIT
thackl2.7k wrote:

A simple way, but not necessarily elegant way:

`# A:10%, T=20%, G=30%, C=40%`

```r = rand between 0 and almost 1 # e.g. perl: rand(1)
if \$r < 0.1            => A
if \$r >= 0.1 or < 0.3  => T
if \$r >= 0.3 or < 0.6  => G
if \$r >= 0.6 or < 1    => C```

op needs di-nucleotide frequencies, like AA: 0.1, AC: 0.2, AT: 0.05, ....

In fact this is more difficult than it seems at first glance.