Question: Complex Genomic Regions
8.2 years ago
learnerforever wrote:

Is there any wiki/white paper available on what exactly are the "complex regions" in the genome? I have come across this term several times in sequencing studies as "regions with high/low complexity" , "difficult to sequence" and just could not figure out what that means.


written 8.2 years ago by learnerforever
8.2 years ago
Casey Bergman
Casey Bergman wrote:

DNA sequence complexity in modern terms has changed from its original meaning, which is part of the reason you are having trouble finding a clear definition.

The original usage of complexity relates to studies using reassociation kinetics to measure how repetitive/unique a genome was, through Cot curve analysis. In short these studies measured how fast denatured DNA reassociated, with faster reassociation impying increased repetetiveness. The "complexity" of a genome was then measured by the time at which half of the DNA was reassociated.

The more modern usage takes this original term and tries to apply it (loosely) to actual DNA sequences, with the same notion that more complex sequences have a higher degree of uniqueness, and vice versa. I am not sure that there is an widely-accepted definition of this modern usage applied to the "complexity" of DNA sequences. My suspicion is that this is operationally defined with respect to the algorithm used (simple sequence repeat detection, compression, etc.). A little google-ing found this definition that seems reasonable (but no reference is provided):

"The complexity of a sequence is defined as the longest non-repetitive sequence that can be derived from a sequence", e.g.

HTH, Casey

written 8.2 years ago by Casey Bergman
8.2 years ago
Chris Evelo
Chris Evelo wrote:

There is probably a lot more to it. But what I remember from Sanger sequencing is:

  • Areas that are not complex at all but very long (e.g. 800 G's), are hard since it is hard to decide on the exact length as soon as it is longer than a read.
  • Areas that share a loot of sequence with other areas are hard as well since you don't know which read comes from which area.
  • Since in Sanger sequencing you use clones to multiply sequence the sequence must actually allow the bacteria to grow. Sometimes that is not the case since the inserted DNA produces proteins that are toxic for the bacteria used. These last ones would of course just be hard to sequence, while they would probably be rather complex.
written 8.2 years ago by Chris Evelo
8.2 years ago
Pierre Lindenbaum wrote:

As far as I remember, NCBI/Blast uses an algorithm known as the Shannon entropy to determine the complexity of a sequence.

written 8.2 years ago by Pierre Lindenbaum
