Question: (Closed) Analyzing pair by pair DNA sequences with for loops in python : could it be faster?
gravatar for nlehmann
4.2 years ago by
nlehmann110 wrote:

Hi !

Here is a code I would need to make far more faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore:, but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

import numpy as np
import time
import random

def calculate_distance(genom1, genom2): 
    # in : two strings (sequences)
    # out : score between those 2 strings

    # example : 'AATTGCAT' compared to 'AAAACGTC'
    # => 'AA' and 'AA' = 2
    # => 'TT' and 'AA' = 0
    # => 'GC' and 'CG' = 2
    # => 'AT' and 'TC' = 1

    score = 0   
    for i in range(0, len(genom1), 2):
        if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
            if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
                score += 2
            else :
                score += 1
    return score

def build_matrix(sequences, N):
    # in : list of lists 
    # out : matrix of scores between each pair of sequences    
    matrix = np.zeros((N, N))
    for i in range(N):
        for j in range(i, N):
            matrix[i][j] = calculate_distance(sequences[i], sequences[j])       
    return matrix

def test(len_seq, N):
    sequences = []
    for i in range(N):
        sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))

    start = time.clock()
    matrix = build_matrix(sequences, N)
    elapsed = time.clock() - start
    print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
    return matrix

test(10**6, 2)

Returns :

> Run time for 2 sequences of 1000000 characters : computed in 2.742817
> seconds
sequence python • 1.1k views
ADD COMMENTlink written 4.2 years ago by nlehmann110

This is not strictly a bioinformatics question, and you would be better served by asking it on a dedicated programming forum. However, this will run in pypy so you might as well just run it there for a speedup.

Via python:

Run time for 2 sequences of 1000000 characters : computed in 3.887798 seconds

Via pypy:

Run time for 2 sequences of 1000000 characters : computed in 0.757249 seconds
ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by John12k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2249 users visited in the last hour