Here is a code I would need to make far more faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.
The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...
Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore: https://en.wikipedia.org/wiki/String_searching_algorithm), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.
import numpy as np import time import random def calculate_distance(genom1, genom2): # in : two strings (sequences) # out : score between those 2 strings # example : 'AATTGCAT' compared to 'AAAACGTC' # => 'AA' and 'AA' = 2 # => 'TT' and 'AA' = 0 # => 'GC' and 'CG' = 2 # => 'AT' and 'TC' = 1 score = 0 for i in range(0, len(genom1), 2): if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]): if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]): score += 2 else : score += 1 return score def build_matrix(sequences, N): # in : list of lists # out : matrix of scores between each pair of sequences matrix = np.zeros((N, N)) for i in range(N): for j in range(i, N): matrix[i][j] = calculate_distance(sequences[i], sequences[j]) return matrix def test(len_seq, N): sequences =  for i in range(N): sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq))) start = time.clock() matrix = build_matrix(sequences, N) elapsed = time.clock() - start print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds') return matrix test(10**6, 2)
> Run time for 2 sequences of 1000000 characters : computed in 2.742817 > seconds