Question

Off topic:Analyzing pair by pair DNA sequences with for loops in python : could it be faster?

0

Entering edit mode

7.4 years ago

nlehmann ▴ 140

Hi !

Here is a code I would need to make far more faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore: https://en.wikipedia.org/wiki/String_searching_algorithm), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

import numpy as np
import time
import random



def calculate_distance(genom1, genom2): 
    # in : two strings (sequences)
    # out : score between those 2 strings

    # example : 'AATTGCAT' compared to 'AAAACGTC'
    # => 'AA' and 'AA' = 2
    # => 'TT' and 'AA' = 0
    # => 'GC' and 'CG' = 2
    # => 'AT' and 'TC' = 1

    score = 0   
    for i in range(0, len(genom1), 2):
        if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
            if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
                score += 2
            else :
                score += 1
    return score


def build_matrix(sequences, N):
    # in : list of lists 
    # out : matrix of scores between each pair of sequences    
    matrix = np.zeros((N, N))
    for i in range(N):
        for j in range(i, N):
            matrix[i][j] = calculate_distance(sequences[i], sequences[j])       
    return matrix


def test(len_seq, N):
    sequences = []
    for i in range(N):
        sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))

    start = time.clock()
    matrix = build_matrix(sequences, N)
    elapsed = time.clock() - start
    print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
    return matrix

test(10**6, 2)

Returns :

> Run time for 2 sequences of 1000000 characters : computed in 2.742817
> seconds

python sequence • 1.6k views

ADD COMMENT • link 7.4 years ago by nlehmann ▴ 140