How can I shuffle a set of sequences while preserving the dinucleotide distribution?
7
6
Entering edit mode
9.3 years ago
glocke01 ▴ 190

I have a set of 500 100-base sequences. I want to make new sequences by randomizing this set into 500 new sequences such that the new set has the same dinucleotide distribution. So, for each sequence in my original set, I'm cutting the sequence up into mononucleotides, putting them in a hat, and drawing them out one by one, but the randomized result should have the same dinucleotide distribution. How do I do this?

The best idea I can come up with at present is to simply repeat a purely random shuffling until I find a result that is sufficiently close to the original. (closeness may be assessed with a paired t-test or, more simply, by taking the difference between the original distribution and the random one.) Is there a better way? I could slice the original sequences by twos to jump-start the process?

Background: I'm trying to reproduce a procedure described in in Weirauch et al. Nature 2012, (http://www.nature.com/nbt/journal/v31/n2/full/nbt.2486.html?WT.ec_id=NBT-201302). In trying to evaluate a TF binding prediction algorithm's success at predicting in vivo binding sites, they start with some ChIP-Seq or ChIP-exo data. From these measurements, they pick 500 loci where the TF is credibly believed to bind, and then test the algorithm's ability to distinguish the true sites from 500 control sites. This is one of the methods they use to create control sites, and they describe it thusly: "500 randomly shuffled positive sequences, where dinucleotide frequencies were maintained." This is the whole description in the methods section, leaving a few things unclear.

1. Do I shuffle the sequences individually or as a set? I mean, do I (a) slice all the sequences in the set and put them in a hat, and then test the dinuc distribution of the new set or (b) shuffle each sequence individually, as described above?
2. How precisely is the dinucleotide distribution maintained?
validation dna control random • 11k views
3
Entering edit mode
9.2 years ago

This is a non-trivial problem involving Eulerian walks. uShuffle claims to provide a solution:

http://digital.cs.usu.edu/~mjiang/ushuffle/

0
Entering edit mode

And here's the reference for uShuffle: Jiang et al. BMC Bioinformatics 2008, 9:192

1
Entering edit mode
2
Entering edit mode
9.2 years ago
UnivStudent ▴ 430

Most people use this script (altschulEriksonDinuclShuffle.py) to shuffle sequences while maintaining dinucleotide frequency (It's even code that's present in the MEME suite). written originally by P. Clote, or a derivative of it. I found this copy online in the Bias-Away package by the Wasserman lab. It works by shuffling each sequence to make a permuted sequence with the original dinucleotide content. So the entire procedure is to shuffle each peak sequence and compare TF binding scores on the original sequences versus all scrambled sequences.

1
Entering edit mode

I got a hold of this same script. I haven't taken the time to read it and figure out what it's actually doing, but I can tell you that the eponymous Altschul and Erickson published their algorithm in Mol Biol Evol (1985) 2 (6):526-538.

3
Entering edit mode
9.3 years ago
Michael 54k

You need to reduce the problem to a single shuffling step such that for each such permutation generated, the di-nucleotide frequencies are the same. An example, imagine you want to shuffle the first C, position

ACTGGGGATG.....
^


Look for all locations where there is a AT, choose one randomly and insert C.

ACTGGGGACTG
_^_


Repeat.

Each step trivially preserves the di-nucleotide frequency.

0
Entering edit mode

Nice! Mine algorithm cant handle such cases, while yours cant handle cases such as AGGGCATTTTC -> ATTTTCAGGGC. So I suggest applying the algorithms one after another will give a good result :)

3
Entering edit mode
9.3 years ago
matted 7.8k

You can solve this with a hidden Markov model over dinucleotides. This is a standard approach in areas like motif identification, where people generate background models with fixed dinucleotide (or trinucleotide, or higher) frequencies.

You generate the transition matrix by observing which dinucleotides lead to which other ones in your training set, where the dinucleotide constraint means that you have to transition to a sequence that starts with the base that you ended with. Then you simulate a random draw from the HMM by picking an initial dinucleotide (according to the marginal distribution) and then drawing from the (observed / target) transition distribution to add another base. You do this until you read your desired simulated sequence length.

This will preserve the dinucleotide frequencies in an approximate way, and the approximation will be increasingly good as you generate more and longer sequences.

Chapter 3 of the classic "Biological Sequence Analysis" goes into this and the details of HMMs.

Edit: Here is some code that does this, with selectable k-mer size. It might crash if your final k-mer is unique (so it has nothing to transition to if it draws that one), but this is rare if you have long sequences.

2
Entering edit mode
9.3 years ago

Looks like quite a complex question.

I can suggest this algorithm, at least it passes test for equal dinucleotide frequency (runs as Groovy script):

The basic idea is that if there exists some position i, which separates the sequence into two subsequences with the same starting and ending base, those subsequences can be swapped. This idea is then applied recursively.

0
Entering edit mode
9.3 years ago
thackl ★ 3.0k

A simple way, but not necessarily elegant way:

# A:10%, T=20%, G=30%, C=40%

r = rand between 0 and almost 1 # e.g. perl: rand(1)
if $r < 0.1 => A if$r >= 0.1 or < 0.3  => T
if $r >= 0.3 or < 0.6 => G if$r >= 0.6 or < 1    => C
0
Entering edit mode

op needs di-nucleotide frequencies, like AA: 0.1, AC: 0.2, AT: 0.05, ....

In fact this is more difficult than it seems at first glance.

0
Entering edit mode