Question: Complex Genomic Regions
7
gravatar for learnerforever
8.2 years ago by
learnerforever520 wrote:

Is there any wiki/white paper available on what exactly are the "complex regions" in the genome? I have come across this term several times in sequencing studies as "regions with high/low complexity" , "difficult to sequence" and just could not figure out what that means.

Thanks!

sequence • 2.7k views
ADD COMMENTlink modified 8.2 years ago by Casey Bergman18k • written 8.2 years ago by learnerforever520
7
gravatar for Casey Bergman
8.2 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

DNA sequence complexity in modern terms has changed from its original meaning, which is part of the reason you are having trouble finding a clear definition.

The original usage of complexity relates to studies using reassociation kinetics to measure how repetitive/unique a genome was, through Cot curve analysis. In short these studies measured how fast denatured DNA reassociated, with faster reassociation impying increased repetetiveness. The "complexity" of a genome was then measured by the time at which half of the DNA was reassociated.

The more modern usage takes this original term and tries to apply it (loosely) to actual DNA sequences, with the same notion that more complex sequences have a higher degree of uniqueness, and vice versa. I am not sure that there is an widely-accepted definition of this modern usage applied to the "complexity" of DNA sequences. My suspicion is that this is operationally defined with respect to the algorithm used (simple sequence repeat detection, compression, etc.). A little google-ing found this definition that seems reasonable (but no reference is provided):

"The complexity of a sequence is defined as the longest non-repetitive sequence that can be derived from a sequence", e.g.

sequence     complexity    
TTTTTTTTTT      1
TATATATATA      2
TACTACTAC       3
TACGTACG      4
TACGGTACGG      5

HTH, Casey

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Casey Bergman18k
3
gravatar for Chris Evelo
8.2 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

There is probably a lot more to it. But what I remember from Sanger sequencing is:

  • Areas that are not complex at all but very long (e.g. 800 G's), are hard since it is hard to decide on the exact length as soon as it is longer than a read.
  • Areas that share a loot of sequence with other areas are hard as well since you don't know which read comes from which area.
  • Since in Sanger sequencing you use clones to multiply sequence the sequence must actually allow the bacteria to grow. Sometimes that is not the case since the inserted DNA produces proteins that are toxic for the bacteria used. These last ones would of course just be hard to sequence, while they would probably be rather complex.
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Chris Evelo10.0k
3
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

As far as I remember, NCBI/Blast uses an algorithm known as the Shannon entropy to determine the complexity of a sequence.

ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 888 users visited in the last hour