The DJS algorithm was first proposed by Cohen et al as a method to identify large regions of distinct nucleotide composition (isochores) within eukaryotic genomes. This algorithm performed well in relation to alternative isochore prediction methods in subsequent comparisons.
I'm implementing the DJS algorithm from the description provided by Cohen et al (second subsection under Methods).
In this procedure, the chromosomes are recursively segmented by maximizing the difference in GC content between adjacent subsequences. The process of segmentation is terminated when the difference in GC content between two neighboring segments is no longer statistically significant.
Briefly, a chromosome of length L, GC content FGC, and AT content FAT = 1 − FGC is divided into two contiguous segments (i = 1, 2) of length li, GC content fiGC and AT content fiAT. These segments are chosen to maximize the Jensen-Shannon entropic divergence measure, DJS, defined as the difference between the overall Shannon entropy Htot and the sum of segment Shannon entropies Hi:
where Hi = -fiGC log2 fiGC - fiAT log2 fiAT and Htot = -FGC log2 FGC -FAT log2 FAT. The segmentation is then repeated recursively for each segment until a halting criterion DJS ≥ DC is met for all segments.
I fear I might have misinterpreted the intent of this section in my initial implementation. On each recursive call to the segmentation procedure, I am recalculating Htot for that particular segment, and then selecting two subsegments to maximize DJS within the scope of that segment. When I first read "halting criterion DJS ≥ DC is met for all segments", this is what I envisioned. However, perhaps "for all segments" means that a single DJS value is calculated across all segments, rather than a distinct DJS value being calculate for each segment separately. The use of summation notation definitely supports this idea.
Should I be calculating Htot once for the entire sequence and then calculating a single DJS value after each round of segmentation?
- Cohen N, Dagan T, Graur D. 2005. GC Composition of the Human Genome: In Search of Isochores. Molecular Biology and Evolution, 22(5), 1260-1272, doi:10.1093/molbev/msi115.
- Elhaik E, Graur D, Josić K. 2010. Comparative Testing of DNA Segmentation Algorithms Using Benchmark Simulations. Molecular Biology and Evolution, 27(5): 1015-1024, doi:10.1093/molbev/msp307.