Question: (Closed) Analyzing pair by pair DNA sequences with for loops in python : could it be faster?

0

nlehmann •

**110**wrote:Hi !

Here is a code I would need to make far more faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore: https://en.wikipedia.org/wiki/String_searching_algorithm), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

```
import numpy as np
import time
import random
def calculate_distance(genom1, genom2):
# in : two strings (sequences)
# out : score between those 2 strings
# example : 'AATTGCAT' compared to 'AAAACGTC'
# => 'AA' and 'AA' = 2
# => 'TT' and 'AA' = 0
# => 'GC' and 'CG' = 2
# => 'AT' and 'TC' = 1
score = 0
for i in range(0, len(genom1), 2):
if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
score += 2
else :
score += 1
return score
def build_matrix(sequences, N):
# in : list of lists
# out : matrix of scores between each pair of sequences
matrix = np.zeros((N, N))
for i in range(N):
for j in range(i, N):
matrix[i][j] = calculate_distance(sequences[i], sequences[j])
return matrix
def test(len_seq, N):
sequences = []
for i in range(N):
sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))
start = time.clock()
matrix = build_matrix(sequences, N)
elapsed = time.clock() - start
print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
return matrix
test(10**6, 2)
```

Returns :

```
> Run time for 2 sequences of 1000000 characters : computed in 2.742817
> seconds
```

This is not strictly a bioinformatics question, and you would be better served by asking it on a dedicated programming forum. However, this will run in pypy so you might as well just run it there for a speedup.

Via python:

Via pypy:

12k