Question: Assessing diversity of random oligonucleotides
1
4.1 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Hi All- I'm looking for some algorithm or statistics to estimate the randomness of a pool of oligonucleotides.

We have synthesized adapters containing a stretch of 15 N (i.e. random A,C,T, or G). This 15N will be part of the sequenced reads (Illumina sequencing, so in the order of millions). Ideally, each nucleotide will have the same chance of being present at any position in the 15N, regardless of the nucleotide (A,C,T, or G) or the position in the string (1 to 15). In practice, some biases are inevitable and some nucleotides are preferentially incorporated etc...

So, is there any simple way of summarizing the randomness of this poll of oligonucletides? I think some ideas are here (Estimating the entropy of DNA sequences) and in sequence logo creation. Any suggestions?

Dario

modified 4.0 years ago by Vincent Laufer1.1k • written 4.1 years ago by dariober10k

How did you pick these sequences?

Hi- The 15N are random sequences (or supposed to be).

1
4.0 years ago by
Vincent Laufer1.1k
United States
Vincent Laufer1.1k wrote:

Hi Dario,

There is a substantial amount of scholarship available on this already, as you seem to have noticed. A google search for "calculating information entropy of DNA sequences" seems to return an abundance of papers, some of which seem to provide answers.

Of these, the most helpful source I located was: Shannon entropy of a DNA motif?  Check the first answer, the links provided in it, and see if they help. If not, let me know and I'll keep looking.

Lastly, I can tell you that much DNA sequence is highly non-random... so depending on where you are looking there might be a strong expectation that there is more or less entropy (e.g. exons tend to have higher IE than introns http://bioinformatics.oxfordjournals.org/content/early/2011/02/10/bioinformatics.btr077.full.pdf) already. Anyway that's an aside but it just makes me curious as to where and why you are looking.

hope it helps.

Thanks, yes the Shannon entropy seems to be suitable. About where and why, I'm looking at supposedly random sequences synthesized by a company, they are not coming from genomic regions.

1

Ah I see. Is it accurate to say that you are determining whether or not something that was advertised as random actually is random?

1

Yes, more precisely I'd like to have a measure of how random the oligo mix is. As far as I know companies put equal molar amounts of the four nucleotides when asked for "N" in the oligo. But since the four nucleotides have different probabilities of being incorporated plus various additional biases, the question is not so much whether the mix is random, rather how badly it deviates from expected random.