I am asked to write a python code to get the best two alignments. respectively. How can I do that?
I am asked to write a python code to get the best two alignments. respectively. How can I do that?
A closed-form solution is offered in An exact formula for the number of alignments between two DNA sequences by Torres, Cabada, and Nieto. If the recurrence formula described in this paper seems reasonable, you could calculate this directly:
#!/usr/bin/env python
import sys
import scipy.special
def alignments(m, n):
if m < 0 or n < 0:
raise ValueError('m and n should be non-negative')
s = 0
b = min(m, n) + 1
for k in xrange(0, b):
s += (2 ** k) * scipy.special.binom(m, k) * scipy.special.binom(n, k)
return s
def main():
m = 4
n = 2
a = alignments(m, n)
assert (m == 4 and n == 2 and a == 41), "something went wrong!"
sys.stdout.write("alignments(%d, %d) = %d\n" % (m, n, a))
if __name__ == "__main__":
main()
The scipy.special.binom
function approximates its calculation of the binomial coefficient, which is probably going to be an issue for large m
and n
. If you're working with large values, you may want to look into using the Python long
type and consider the accuracy of the answer. The Stack Overflow link offers alternatives to the scipy.special
library that generate an exact calculation of the binomial coefficient, at the cost of a longer runtime.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think this is not using edit distance scoring function, right?
The closed-form solution counts all gapped alignments, and it is derived from recurrence relations for alignments of sequences of lengths m and n. Gaps in an alignment would represent insertions or deletions relative to one or the other sequence. An edit distance function usually scores gaps and substitutions. The score shouldn’t necessarily matter in terms of number of total alignments, unless you need to count only some subset of all alignments under some cost or score threshold.