Optimizing DNA Sequence Subset Selection for Balanced Base Distribution: Seeking Solutions for Small N Values
0
1
Entering edit mode
23 months ago

Introduction:

Greetings,

I am working on an automation task for a research project at my institute. It is a problem that regularly occurs in DNA sequencing. My code works, but doesn't find solutions for small enough values of N. The best found so far is N=11, but I would ideally like it down to N=6.

Any help would be greatly appreciated.

Code and data can be found here: https://github.com/Kodemannen/index-calculator

Problem statement:

I have a dataset of DNA sequences, each with a length of 10 bases. My goal is to select a subset of N sequences (where N >= 6) in such a way that all the four bases (A, C, T and G) are distributed fairly balanced across all the 10 base positions.

Moreover, the dataset actually have two columns of sequences, which we can call column A and B, and we want to select a subset such that the criterion is satisfied for both columns separately.

Data representation:

The dataset is structured as follows:

index_name:     col_A:          col_B:
SI-TS-A1        AATTTCGGGT      GAGGAGAGAG
SI-TS-A2        CTACTGAATT      AGCGCTAGCG
...             ...             ...

Evaluating a solution:

A solution candidate (here for N=6) would for example look like this (skipping the index_name column):

col_A:          col_B:
ACCGACTGAT      GAACCCAGGA
CAGGACTCTA      CCATCCTACG
CAAGCGTTAG      CGCATTCGCG
GTTAAACCGC      CAAGCCGGCT
TATGTTGTGC      ATTAAGCCAT
TACATATGCT      TCACGATTAC

To evaluate, we look at column A and B as two distinct sets that each need to satisfy the base distribution criterion.

Let's take column A and go through how we evaluate: First we count up the number of bases of each base type in each position, as exemplified by the following table:

Position |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10
=====================================================================
Base A   |  1  |  4  |  1  |  2  |  3  |  2  |  0  |  0  |  2  |  1
Base C   |  2  |  1  |  2  |  0  |  1  |  2  |  1  |  2  |  1  |  2
Base G   |  1  |  0  |  1  |  4  |  0  |  1  |  1  |  2  |  2  |  1
Base T   |  2  |  1  |  2  |  0  |  2  |  1  |  4  |  2  |  1  |  2

By dividing all the values by N, we get the fractional representations, as displayed below:

Position |  1   |   2  |   3  |   4  |   5  |   6  |   7  |   8  |   9  |  10
===============================================================================
Base A   | 0.17 | 0.67 | 0.17 | 0.33 | 0.50 | 0.33 | 0.00 | 0.00 | 0.33 | 0.17
Base C   | 0.33 | 0.17 | 0.33 | 0.00 | 0.17 | 0.33 | 0.17 | 0.33 | 0.17 | 0.33
Base G   | 0.17 | 0.00 | 0.17 | 0.67 | 0.00 | 0.17 | 0.17 | 0.33 | 0.33 | 0.17
Base T   | 0.33 | 0.17 | 0.33 | 0.00 | 0.33 | 0.17 | 0.67 | 0.33 | 0.17 | 0.33

The distribution criterion is then the following:

For each of the 10 positions, we want at least 12.5% and no more than 40% representation for each base. We also require that the median across all positions (for each base) should be in the same interval.

The same is then repeated for column B. If both columns pass, then the solution is valid.

In other words: All numbers in the above table must be between 0.125 and 0.4 for a solution candidate to pass the test.

We can see that the above candidate does not pass the test, as several elements are outside the desired interval.

What I've tried:

I've attempted various approaches:

  • Random search works fine for large N. Starts taking a long time around N = 25.

  • Evolutionary algorithm. This works well until we get down to about N around 12, then it starts struggling.

  • Evolutionary algorithm with steps of greedy search between each generation. Works a little bit better than just the evolutionary algorithm alone. Was able to find a solution for N=11, which is the lowest N I have found a solution for thus far.

  • Tried initializing individuals by maximizing the Levenshtein distance between the components (sequences), but did not help much.

I am not sure how to improve this further (except that I could probably speed it up by using more built in functions from the Deap library). Could be that I just need to find better hyperparameters, but I don't have that much experience with these types of algorithms, so I am kind of stuck.

More details:

  • An individual in the population is one solution candidate, represented by a set of N unique indices, each index corresponding to one row in the dataset (meaning a pair of sequences).
  • The fitness function is defined as the distance to the desired interval.

    • I did this in the following way:
    • For each position for each base (ref. the above table), if the element is outside of the desired interval, the distance from the desired interval is calculated, and then summed up for all into a final scalar value.
    • To also consider the the median criteria in the fitness function, the same as above is done (taking the distance from the median to it's corresponding desired interval), and this is simply added this (with a custom weight parameter) to the previously discussed fitness value.
    • The fitness is thus defined such that we are minimizing it.
  • Parents are selected probabilistically, where we weigh the probability of selection using the fitnesses.

  • Offspring are created using a is done using single-point crossover recombination function, and mutation is then applied.

  • Elitism is implemented by always letting the top 1. individual survive to the next generation,

  • Greedy search is done for every generation, by iterating over all individuals in the population, where for each individual, we select a random index (gene) in the individual and replace this with a new random one. If the fitness of this new individual is better, we replace it, else we discard it. Repeat this a set amount of times per individual (this is controlled by the improvement_steps variable).

  • If nothing else works, we can loosen the criteria by ignoring the last two positions. This is implemented using the ignore_last_k_positions parameter.

Running the code:

The code is written in Python, using the Deap library made for evolutionary algorithms.

Probably best to install the dependencies with Anaconda, as that's what I did.

You can create an environment called envname from the .yml file with the following command:

conda env create --name envname --file environment.yml

Then activate the environment by

conda activate envname

and finally run the code with

python3 main.py
optimization evolutionary-algorithm • 439 views
ADD COMMENT

Login before adding your answer.

Traffic: 3756 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6