Assessing diversity of random oligonucleotides
1
1
Entering edit mode
9.1 years ago

Hi All- I'm looking for some algorithm or statistics to estimate the randomness of a pool of oligonucleotides.

We have synthesized adapters containing a stretch of 15 N (i.e. random A,C,T, or G). This 15N will be part of the sequenced reads (Illumina sequencing, so in the order of millions). Ideally, each nucleotide will have the same chance of being present at any position in the 15N, regardless of the nucleotide (A,C,T, or G) or the position in the string (1 to 15). In practice, some biases are inevitable and some nucleotides are preferentially incorporated etc...

So, is there any simple way of summarizing the randomness of this poll of oligonucletides? I think some ideas are here (Estimating the entropy of DNA sequences) and in sequence logo creation. Any suggestions?

Dario

sequence kmer random oligonucleotides • 1.9k views
0
Entering edit mode

How did you pick these sequences?

0
Entering edit mode

Hi - The 15N are random sequences (or supposed to be).

1
Entering edit mode
8.9 years ago
LauferVA 4.3k

Hi Dario,

There is a substantial amount of scholarship available on this already, as you seem to have noticed. A google search for "calculating information entropy of DNA sequences" seems to return an abundance of papers, some of which seem to provide answers.

Of these, the most helpful source I located was: Shannon entropy of a DNA motif? Check the first answer, the links provided in it, and see if they help. If not, let me know and I'll keep looking.

Lastly, I can tell you that much DNA sequence is highly non-random... so depending on where you are looking there might be a strong expectation that there is more or less entropy (e.g. exons tend to have higher IE than introns http://bioinformatics.oxfordjournals.org/content/early/2011/02/10/bioinformatics.btr077.full.pdf) already. Anyway that's an aside but it just makes me curious as to where and why you are looking.

Hope it helps.

0
Entering edit mode

Thanks, yes the Shannon entropy seems to be suitable. About where and why, I'm looking at supposedly random sequences synthesized by a company, they are not coming from genomic regions.

1
Entering edit mode

Ah I see. Is it accurate to say that you are determining whether or not something that was advertised as random actually is random?

1
Entering edit mode

Yes, more precisely I'd like to have a measure of how random the oligo mix is. As far as I know companies put equal molar amounts of the four nucleotides when asked for "N" in the oligo. But since the four nucleotides have different probabilities of being incorporated plus various additional biases, the question is not so much whether the mix is random, rather how badly it deviates from expected random.

0
Entering edit mode

Interesting. I vaguely remember reading that in the Sanger paper a really long time ago. I think you could probably quickly ballpark this without using an algorithm from the information entropy lit, but since there is a packaged solution for everything these days, I'd just go ahead and do it. Was the link provided helpful enough or are you probably going to keep looking?