4
3
Entering edit mode
10.7 years ago

Let's say I have the string

AACAAAACAAAAAGTTGGGTAGTTTGAGAAC

and I want to scramble it for statistical testing purposes, but keep the same nucleotides and dinucleotides. Is there a script/library that does this?

Thanks.

3
Entering edit mode

Keeping the exact di-nucleotide frequency is interesting. It can be reduced to a kind of graph traversal problem. In the directed graph, the vertices are A/C/T/G with an arc between each of them. Each arc is associated with a count c(a,b), the count of corresponding di-nucleotide (a,b) in the original sequence. The goal is to find a path using arc (a,b) exactly c(a,b) times. I do not know if there is an efficient algorithm to enumerate all the solutions. A naive NP algorithm is to do a simple backtracking. I think all the answers so far do not guarantee to keep the exact frequency.

PS: Another reduction: break the original sequence into 2-mer "read" and the goal is to assemble a new "contig" using each "read" exactly once. A NP-hard Hamilton problem. I think there are no polynomial solutions anyway.

1
Entering edit mode

Yeah you are right. The listed solutions will not take into account all the di-nucleotides since spliting the sequence by every 2 nucleotides misses the di-nucleotides "between" each splitted di-nucleotide. IE. AGTCGA => AG, TC, GA. But we are missing GT, CG di-nucleotides.

1
Entering edit mode

I like your graph formulation but maybe we can modify it to allow an efficient algorithm. Instead of making a complete graph with edge counts we may create a multigraph with c directed edges between (a,b) instead. Solutions correspond to eulerian trails of this graph. The graph must have at least one such trail.

0
Entering edit mode

Yeah, you are of course right. My brain was dysfunctioning... Then the question is then how to modify Hierholzer's algorithm to randomly sample one Eulerian path in linear time. Enumerating all solutions is still non-polynomial.

5
Entering edit mode
10.7 years ago

Python's random module has a shuffle method that allows you to shuffle elements of a list. You can do something like this:

import random
#shuffle works on lists, you have to convert your string into a list object
seq = list('AACAAAACAAAAAGTTGGGTAGTTTGAGAAC')
random.shuffle(seq)
print ''.join(seq)


Probably not the most efficient way to do it if you have a very long sequence.

To do it by di-nucleotides, just break your string up into a di-nucleotide list:

seq = 'AACAAAACAAAAAGTTGGGTAGTTTGAGAAC'
seq = [seq[x:x+2] for x in range(0,len(seq),2)]
random.shuffle(seq)
print ''.join(seq)


Note that for odd length sequences, the last element will be a single nucleotide. If you want to remove that, check for odd length and pop off the last element of the list.

edit**

As brent and hengli commented, this actually doesn't preserve all the di-nucleotide frequency. A more sophisticated method will be needed to do this.

I guess you can treat this problem similar to a sequence assembly. Generate k-mers with k=2 for the sequence and make a redundant debruijn graph. All possible complete traversals of this graph would be the possible shuffled sequences. Then choose a random one out of the possible traversals.

Getting complete traversals becomes a traveling salesman problem....

Brentp's solution further down is probably the best solution you can achieve without doing crazy algorithms.

0
Entering edit mode

How does this preserve dinucleotide frequency? I guess I could split the list into pairs and shuffle those.

0
Entering edit mode

I've added how to do it by di-nucleotide

0
Entering edit mode

This doesn't preserve the dinucleotide frequency does it? You're just changing the unit of shuffling to 2 bases.

4
Entering edit mode
10.7 years ago
brentp 24k

EDIT: this will not have the exact di-nulceotide frequency, but the emission probability is based on the input sequence so it will be close

If you actually want to maintain the di-nucleotide frequency this will do it in python:

from collections import defaultdict, Counter
import random

def difreq(seq):
counts = defaultdict(lambda: defaultdict(int))
for a, b in zip(seq, seq[1:]):
counts[a][b] += 1
return dict((k, dict(v)) for k,v in counts.iteritems())

# <http://stackoverflow.com/questions/3679694/a-weighted-version-of-random-choice>
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))

def shuffle_difreq(seq):
freqs = difreq(seq)

# get the first base by the total frequency across the sequence
shuff_seq = [weighted_choice(Counter(seq).items())]

for i in range(1, len(seq)):
# each following base is based of the frequency of the previous base
# and their co-occurence in the original sequence.
shuff_seq.append(weighted_choice(freqs[shuff_seq[-1]].items()))

return "".join(shuff_seq)

if __name__ == "__main__":
seq = 'AACAAAACAAAAAGTTGGGTAGTTTGAGAAC' * 20
print difreq(seq)
shuff_seq = shuffle_difreq(seq)
print difreq(shuff_seq)


the "emission probability" of any given base depends on its frequency with the previous base in the original sequence. This example will print out the coocurrence frequencies so that you can verify that is the case.

1
Entering edit mode

This algorithm has a little hiccup. Try running

shuffle_difreq("TTTTTTAATTATGTTGTATTTTAGGTTTC") a couple of times and you will get an error. It seems to happen when a nucleotide which is only included once in the string is in the first or last position. But it does not happen all the time.

0
Entering edit mode

If the last base in the new sequence doesnt have a transition probability (like 'C') in your example, then it will cause this error. You can get around it with this function:

def shuffle_difreq(seq):
freqs = difreq(seq)

# get the first base by the total frequency across the sequence
shuff_seq = [None]
while not shuff_seq in freqs:
shuff_seq = [weighted_choice(Counter(seq).items())]

while len(shuff_seq) < len(seq):
# each following base is based of the frequency of the previous base
# and their co-occurence in the original sequence.
try:
shuff_seq.append(weighted_choice(freqs[shuff_seq[-1]].items()))
except KeyError:
shuff_seq.pop()
assert len(shuff_seq) == len(seq)
return "".join(shuff_seq)

0
Entering edit mode

This won't give you a shuffle sequence with the exact same di-nucleotide frequency right? It will give a shuffle sequence with something very close the di-nucleotide frequency though. But if the sequence is short, the shuffled and original sequence will diverge more relatively.

0
Entering edit mode

I agree. The difficult part is to generate a sequence with the exact di-nucleotide count. Sometimes, there is only one solution (e.g. "ACGT"). Sometimes, there are many (e.g. "AAAAAACAAACAAAACAAAACAAAA").

0
Entering edit mode

no, definitely not exact. but seems to address the problem more closely than literally shuffling pairs of bases, no?

0
Entering edit mode

Thanks, this is the correct answer. It seems to have been a more knotty problem than I had thought.

Mathematically speaking, is it possible to say how "bad" this algorithm will be for reads with 26-31 nucleotides in length? I see that you in your example use a string 20 times longer than mine.

3
Entering edit mode
10.7 years ago
SES 8.6k

If this is the type of sequence you want to shuffle, a one-liner will do:

$perl -MList::Util -le 'print(join "", List::Util::shuffle(split //, "AACAAAACAAAAAGTTGGGTAGTTTGAGAAC"))' ATGGCAAAAGAGGATGCTTTACAGAATAAAA  You can put that in a script to do whatever calculations you need. If you are working with longer sequences, I'd suggest using Pierre's solution or an existing code library (IIRC, genometools has some sequence shuffling programs and a python API). Note this just shuffles the sequence irrespective of the dinucleotides. This keeps the same dinucleotides: $ perl -MList::Util -le 'print(join "", List::Util::shuffle(unpack("(A2)*", "AACAAAACAAAAAGTTGGGTAGTTTGAGAAC")))'
AATTAAAGTGCGTAACATTAGACAGGGAAAA


ETA: This solution will be as fast as any C/C++ solution for shuffling (shuffle<> is written in C), but join will use memory and likely cause the whole program to be slower than one written in a compiled language. For sequences of this length though, I don't think it will make a noticeable difference (in terms of speed) how you decide to solve this problem.

ETA2: After following the discussion above, it appears that the second approach will not do what you want in terms of dinucleotide frequencies. So, a fast solution with the wrong answer is still no help ;-). It will be interesting to see how someone works this out.

1
Entering edit mode
10.7 years ago

in c++. I used the poly-XXX instead of the di-nucleotides. Feel free to break the 'j' loop when j-i>2.

#include <iostream>
#include <cstdlib>
#include <string>
#include <vector>
#include <ctime>
#include <algorithm>
using namespace std;

int main(int argc,char** argv)
{
if(argc!=2) return -1;
srand(time(0));
size_t i=0;
vector<string*> nucl;
while(argv[i]!=0)
{
size_t j=i+1;
while(argv[j]!=0 && argv[i]==argv[j])
{
++j;
}
nucl.push_back(new string(&argv[i],j-i));
i=j;
}
for(;;)
{
random_shuffle(nucl.begin(),nucl.end());
for(size_t i=0;i< nucl.size();++i) cout << *nucl[i];
cout << endl;
}
return 0;
}

\$ g++ -Wall -O3 code.cpp && ./a.out AACAAAACAAAAAGTTGGGTAGTTTGAGAAC | head
CAAAAGTTTAAAAAAAGCCAAGTTTGGGGAA
GGAAAAAGCAAAAACTTTTTTCAAAGGGGAA
GGGGAAAACAATTGTTTAATAAAAAAACCGG
AGGGACTTTAAAAATTGAAAAGGCGAAAACT
AATTTGTAGTTACAAAAAAGGGGGAAAAACC
CGACGTTATTTTAACGAAAAGGGGAAAAAAA
AAAAACGTTTGGAACTTTGGGAAAAAAACAG
CAGAAAAGGGAGCAAAAAGTTTAAAATTTGC
AATATTTCTTAGGAAGCGGGGAAAAACAAAA
GGGCAGTTCAAAAAGTTTCAAAAGAAAATGA