Generate Max Number of DNA Sequences separated by a given Hamming distance
0
0
Entering edit mode
4 weeks ago
rtrende ▴ 80

I would like to make a pool of 7nt DNA sequences that are all separated from one another by a Hamming distance of 3. Is there a way to generate a list with the maximum number of such sequences? I have written algorithms to do this with an element of randomness (e.g. iteratively generate sequences and add them to a list if they are at least 3nts away from all other sequences in a list), but the number of sequences I get with this approach is variable, so I would like a way of analytically computing a list of sequences so I can "fit" as many sequences as possible in my list

Update 7/31/24:

I believe the best way to do this is that this article https://link.springer.com/article/10.1023/A:1011275112159 describes a way to generate 8192 9-letter quaternary sequences. With this list, I could:

1. Identify at least 8192/4=2048 sequences whose nth letter match

2. Select these sequences and trim the nth letter. Because they were all separated by a Hamming distance of 3 and the only letter I trimmed was identical in all sequences, this will yield 2048 8-letter sequences

3. Repeat to trim to 512 7-letter sequences

However, I do not really understand how the above article generated 8192 9-letter sequences, and an algorithm they use (wclique) is, as far as I can tell, not readily available to download. Any guidance from someone more mathematically literate than I on how to implement the algorithm described in the article above to generate 8192 9-letter sequences would be much appreciated!

hamming-distance Barcode variants DNA • 434 views
0
Entering edit mode

rtrende, hmm seems like you should either end up with exactly 4 or exactly 9 oligos in your list, depending on the seed oligo you provide, no?

1
Entering edit mode

oh. sorry. you said at least three. the first sentence of your post says by a distance of 3, it made me think you were interested in the exact case. What did you get as your max? im getting around 315 or so.

heres one i got 310

310 GGCCACT,TGATGGC,CTAGTGT,TTCGCTT,TTAATTC,ACTTAAT,ATTGTGG,CGGAGTA,GAACGCG,ATCATAT,CTGCACA,CCGGTTA,TCGGGAC,TTTAGAA,GGTCGCC,ACATCTG,CAATAAA,ACCGACA,CTAAGCC,ACGTTCT,GAGTGTT,CCCAAGA,GCAAGTA,TTAGAAG,TTCTCGG,TCTCAGT,CGCACCG,AGACTTA,AGGAACT,CATCTTC,AATAGTT,GTGCTTG,ATGACCG,AAGACGT,TTATACA,TTGACAC,GTTGCTA,GCAATAC,ACGCGGA,CTTTGGA,CTATTAC,TTTCTAT,GCCTGAT,TGGCAGA,TGGATGC,GGTTCTG,GTTTCGT,AAAGAGG,CAGCCCG,AGGTAGC,GCCCCTC,GTCCAGC,GGGACCC,TTTAAGG,CCTTGTG,GTAGTCG,CAGATCA,AGTGTCT,GAATCAC,TGGAATG,TCAACAG,GTCCTAA,ATATGAG,CTTTATC,ATAACGC,GTAAGGG,TATCCGG,TTGCGCG,AGCCGCA,CACTGAG,AATTGAA,AGTTGTC,ATGAATA,AGGAGAG,ATCCTTC,AAACCCC,CACCACC,GCTCTGG,AGTGCAC,GACAATA,CAGGTAC,TATACTC,GCGAAGC,TCCCGTA,CTGGCTG,TCCTTTT,CCACAAT,TGAGTGG,TACTTGC,TACAGCA,GGAGACC,GCAGCGA,CACTCCT,AGAGCGT,TCTTCTA,ACCACAA,TAAAGTG,CAAGGCT,CGATCTT,TCGAAAA,CGTATTT,CGAGATA,CCTACAC,CCCGTCC,AACATGA,CGGTGAC,CAAGCAG,TGCTAAT,ACAAATC,GGATAAG,GGCTATC,GCTATCT,CGCCCGC,CAGTCGA,CGTCTCG,ACGATTG,AAGTGGG,CCTGCCA,GGAATGA,GTCCCCG,TATGGTA,GGTCAAA,TTAGCCC,TAGTCCC,CTACGAA,AACAGAC,ACTCATG,TGTACCT,TCTGTTG,TGACGAG,CGATTCA,AGCTTGT,GATGAAC,TCGCCCA,GTCTGTA,ACAGTAG,CGCGGAA,GAGAACG,GGCGTGC,AACTTCG,AAGGTTT,TCCCTAG,GTTCGTT,CCCGCGG,AACCAAT,GCGGGCA,TAGGGGT,TTCGGGC,TGGGTAT,CTTTTCT,GATACAA,TCACTCC,TGTGGCG,AAACTGT,CTCCGGT,TATTACT,CTCATTG,GAAGGAA,CCGTCAG,AATATCC,CCGCGTT,GTGTACT,CAAAAGT,TCCGGCT,CTCGACG,GACCTTT,TAGCAAG,CTGCTGC,CGAATAG,GCTTTAA,TAACATA,TCGTTGA,AAAGTCA,GTGGCGC,TACGCGA,CCGAGGG,TGGTGCA,ATGTCTT,CTGAAAT,TACACAT,GCCGATT,AGTTACG,CCTGGAT,AGTAAGA,ACCAGCG,GGAGGTG,AGGCCAT,GGCAGGT,TGACCTC,CACCTGG,GCATAGT,GTCATCC,GAGTTAG,TTGGATC,CCATCGC,ATGGTCC,GTAACAT,GGAAATT,GAGCCTA,CGTGGGC,GGCGCCA,AATGCTG,AGGCTGG,GCTAGAG,ATGCGAC,CACGGTC,TGCTGTG,CATGCGT,ACGCACC,TTGTGAT,ACATGCC,ATTTCCC,TTCCATG,ACCTAGG,CTTCCAG,TTCAACT,CCCCGAC,GGGGAGG,AGCTCAG,GCGGTGT,AGCGTTG,AACTCTC,GAGGCAT,TCCTCAC,GATAGGC,AGGGAAA,TAAAAAC,ATTGGCA,TTACTGA,CTATCCG,GCTGACG,ACTTTGC,GTGGGAG,ATCTAAA,AACCGTG,ATCTGCT,GATATTG,GTATTTT,CGCCTAT,GTGAGTC,TGGTCGT,GTCAAAG,AAAACTA,CGCGAGT,ATTCACT,GATCCCT,CATCGCA,TAATTAT,TGCATTA,GATTAGA,CTGTAGG,AAGCTAA,GACACGG,ATCCCGA,GTTAACA,TCTAGCC,GTTTGCG,GCGTATA,TGGGCTA,ACTCTCA,AAGCGCT,ATAGGTC,GGTGTAG,GAAGTTC,GAGCAGT,GAGCTCC,GTGTCAA,AATAAAG,CTTGAAA,TATATGT,GCGACTT,CGACAGG,ATGGAGT,CCGGACT,AGACAAC,CAGAATC,CACTTTA,TGCCCAA,GACCGGA,CGGTTTG,GACTGCC,ATTTTTA,ATAGCAA,TGTGATT,TCAAGGT,TCAGAGC,CTCGCAC,TCATGAA,CCAACCT,AATCAGC,AGTCGGT,CCACTTG,TATCGAC,AAGTACA,AAATATT,AGCACTT,TGTTTAC,CGTAACC,ACTACGG,TAGGTCG,ACGGCTC,CGTTCAA,GCACGGC,CCCTGCA,GCACACA,GGATGCT,TTCGTCA,GCTGGTC,TCGTACG,ACCCCCT

0
Entering edit mode

I've also typically gotten high 200s to low 300s using code that randomly generates sequences and adds them to a list if they're >=3nts from all other sequences in the list

However, for my application, I would like as many sequences as possible. The best way I have found to estimate how many I could get is the Hamming bound (https://en.wikipedia.org/wiki/Hamming_bound), which implies I should be able to get <=744 sequences that are 7nt long separated by a Hamming distance of 3. This makes me think the random approach is inefficient, so I am wondering if there is some algorithm that efficiently fills sequence space to fit as many 7nt sequences separated by 3nts as possible

0
Entering edit mode

Following up on above, this link https://www.win.tue.nl/~aeb/codes/quaternary-1.html suggests that it should be possible to generate 512-596 7nt sequences separated by a Hamming distance of 3, suggesting that a smarter algorithm may be able to generate a lot more BC sequences than the random algorithms that we wrote. I just have no idea how to write such an algorithm haha and so would appreciate if anyone has guidance on this front

0
Entering edit mode

yeah thats the theoretical max but you wont approach it. after hundreds of billions of iterations you might get like 340. you could heuristically guide the sequence order to leave as much of the sequence space unpreturbed for as long as possible (rationally guided approach) but you are going to do that much better than the best brute force result