I have a set of 500 100-base sequences. I want to make new sequences by randomizing this set into 500 new sequences such that the new set has the same dinucleotide distribution. So, for each sequence in my original set, I'm cutting the sequence up into mononucleotides, putting them in a hat, and drawing them out one by one, but the randomized result should have the same dinucleotide distribution. How do I do this?
The best idea I can come up with at present is to simply repeat a purely random shuffling until I find a result that is sufficiently close to the original. (closeness may be assessed with a paired t-test or, more simply, by taking the difference between the original distribution and the random one.) Is there a better way? I could slice the original sequences by twos to jump-start the process?
Background: I'm trying to reproduce a procedure described in in Weirauch et al. Nature 2012, (http://www.nature.com/nbt/journal/v31/n2/full/nbt.2486.html?WT.ec_id=NBT-201302). In trying to evaluate a TF binding prediction algorithm's success at predicting in vivo binding sites, they start with some ChIP-Seq or ChIP-exo data. From these measurements, they pick 500 loci where the TF is credibly believed to bind, and then test the algorithm's ability to distinguish the true sites from 500 control sites. This is one of the methods they use to create control sites, and they describe it thusly: "500 randomly shuffled positive sequences, where dinucleotide frequencies were maintained." This is the whole description in the methods section, leaving a few things unclear.
- Do I shuffle the sequences individually or as a set? I mean, do I (a) slice all the sequences in the set and put them in a hat, and then test the dinuc distribution of the new set or (b) shuffle each sequence individually, as described above?
- How precisely is the dinucleotide distribution maintained?