Off topic:Analyzing pair by pair DNA sequences with for loops in python : could it be faster?
0
0
Entering edit mode
7.4 years ago
nlehmann ▴ 140

Hi !

Here is a code I would need to make far more faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore: https://en.wikipedia.org/wiki/String_searching_algorithm), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

import numpy as np
import time
import random



def calculate_distance(genom1, genom2): 
    # in : two strings (sequences)
    # out : score between those 2 strings

    # example : 'AATTGCAT' compared to 'AAAACGTC'
    # => 'AA' and 'AA' = 2
    # => 'TT' and 'AA' = 0
    # => 'GC' and 'CG' = 2
    # => 'AT' and 'TC' = 1

    score = 0   
    for i in range(0, len(genom1), 2):
        if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
            if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
                score += 2
            else :
                score += 1
    return score


def build_matrix(sequences, N):
    # in : list of lists 
    # out : matrix of scores between each pair of sequences    
    matrix = np.zeros((N, N))
    for i in range(N):
        for j in range(i, N):
            matrix[i][j] = calculate_distance(sequences[i], sequences[j])       
    return matrix


def test(len_seq, N):
    sequences = []
    for i in range(N):
        sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))

    start = time.clock()
    matrix = build_matrix(sequences, N)
    elapsed = time.clock() - start
    print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
    return matrix

test(10**6, 2)

Returns :

> Run time for 2 sequences of 1000000 characters : computed in 2.742817
> seconds
python sequence • 1.6k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6