Generate Random Dna Sequence Data With Equal Base Frequencies
3
1
Entering edit mode
12.3 years ago
User 4000 ▴ 50

Hi all.Does anybody know how to generate random DNA sequences (about 20) with equal base frequencies? (I want to generate this data for a test)

random sequence • 16k views
ADD COMMENT
4
Entering edit mode
12.3 years ago

A solution using Python:

import random

def random_dna_sequence(length):
return ''.join(random.choice('ACTG') for _ in range(length))


You want a DNA string with equal base probability. So the probability of each base appearing is 0.25. With the following function you can check how much each DNA string deviates from this predicted probability:

def base_frequency(dna):
d = {}
for base in 'ATCG':
d[base] = dna.count(base)/float(len(dna))
return d

for _ in range(20):
dna = random_dna_sequence(100)
print dna, base_frequency(dna)


which would generate a result like:

AAGTGACGCCCGGTGCGAAAAACACGCGCCTCTCCGTAGTCATTCAGACT {'A': 0.26, 'C': 0.32, 'T': 0.18, 'G': 0.24}
AAGGATCTACTACCTCGTCTATTTGAACTACTGTAGTGCTACTAACTCAT {'A': 0.28, 'C': 0.24, 'T': 0.34, 'G': 0.14}
TCCACTTCTTGGTCCTGAACACCTGCAATCACCTCTTACATCGTGCGACG {'A': 0.2, 'C': 0.36, 'T': 0.28, 'G': 0.16}
AATCTCCGGTGTGTCCGCTACGGAGGTTAGGGCACTCCGTGGGAAAGCTC {'A': 0.18, 'C': 0.26, 'T': 0.22, 'G': 0.34}
GCGTAGTTCGCATTGATTAACATAGTGGCGACCATAGACTTCTATTATCG {'A': 0.26, 'C': 0.2, 'T': 0.32, 'G': 0.22}
AAGTGAACCTGGACTGGGTGGATCGTCTCCCTCGTCCGGTCCTTGGTAGC {'A': 0.14, 'C': 0.28, 'T': 0.26, 'G': 0.32}
ATGACGATGACGATCATCGTCAACGCGCGTCGCGCACACTGCATATCCAA {'A': 0.28, 'C': 0.32, 'T': 0.18, 'G': 0.22}
GTGCATACCGGTGCGCGCGTGCGCTAGGTATTGGAATGCTACGCTTAACC {'A': 0.18, 'C': 0.26, 'T': 0.24, 'G': 0.32}
GCCCGCGTGCCGCCAAGGGATGGGGAGAGTATTTTCGCCCCCTAAGTGCC {'A': 0.16, 'C': 0.32, 'T': 0.18, 'G': 0.34}
TCAAGATTCTCCTAAATATATAATGATCATCCGTTGTCATTCTGCGGACT {'A': 0.28, 'C': 0.22, 'T': 0.36, 'G': 0.14}
TGTTTTAGCCCTGTAGCCGGACTACGAAGTTTTAGGCGCCCAGATTAAGG {'A': 0.22, 'C': 0.22, 'T': 0.28, 'G': 0.28}
AGACGAGCTTTCAAGTTCTTGAATCACTACCTTTGACGTCGAGTGTAAGG {'A': 0.26, 'C': 0.2, 'T': 0.3, 'G': 0.24}
TCGCATTGTAAATAGGAACCTGAAACCTGCCAAGGAGATACAGTCTAAAT {'A': 0.38, 'C': 0.2, 'T': 0.22, 'G': 0.2}
CATCCGTGTGGTAACAGTTAATGCCGGGCTCACCCTCAGGTGTGAAGGAT {'A': 0.22, 'C': 0.24, 'T': 0.24, 'G': 0.3}
ACCAAGACATACCTTAAGGCCCACGCGTACAAGTCACGCTCTCAATACGG {'A': 0.32, 'C': 0.34, 'T': 0.16, 'G': 0.18}
CGTCGTTGGTATTCAGAAAACGCTAGCACATATGGTGCCCAGTCAAAGGA {'A': 0.3, 'C': 0.22, 'T': 0.22, 'G': 0.26}
CGTCATTGCACCAAGTGTGGTACTTTGGGGACGTGAGGTAACAATCCCTG {'A': 0.22, 'C': 0.22, 'T': 0.26, 'G': 0.3}
TGGTCCCTGTTTCTCCATTCCGCGTCCATCGTGCGTTCGTCCTTTAAAGT {'A': 0.1, 'C': 0.32, 'T': 0.38, 'G': 0.2}
AATTCACTCTTTTAACGATGGAAACGGGCGTTTGTAGTGTGCCACTAACC {'A': 0.26, 'C': 0.22, 'T': 0.3, 'G': 0.22}
CCTTGTATACCCCACATGAAGAATGGGCCTGACATCAATAATCTTTAGAT {'A': 0.32, 'C': 0.24, 'T': 0.28, 'G': 0.16}

ADD COMMENT
2
Entering edit mode
12.3 years ago

A simple way using R

# define which bases will make up your sequences
bases <- c(rep('A', 5), rep('C',5), rep('G',5), rep('T',5))
# set how many sequences you want to produce
numOfSeqs <- 10
# initialize empty object
seqs <- rep (NA, numOfSeqs)
# populate the object by shuffling and joining your bases
for (i in 1:numOfSeqs){
seqs[i] <- paste(sample(bases, length(bases)), collapse = '')
}


Then you can do what you want with object seq.

Clearly, if you need to produce a very large number of sequences, you have to find a way to print them to file or it will fill the memory.

ADD COMMENT
0
Entering edit mode

Thanks alot.Anyway,is there any software to do this without writing a code for it?

ADD REPLY
1
Entering edit mode
12.3 years ago
Ahdf-Lell-Kocks ★ 1.6k

One option is to use PhyloSim

ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6