Question

Question on number of possible alignments

3

Entering edit mode

9.4 years ago

freezedry ▴ 30

Hello all. For two sequences with length m, it is known that there are >3^m possible alignments. I am aware that it is due to aligning using match/mismatch or indels.

But is there a way to know how many possible alignments are there for 3 sequences of length m?

Thank you all for your time.

alignment • 9.5k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.4 years ago by freezedry ▴ 30

Ram · Answer 1 · 2016-02-26

3

Entering edit mode

9.4 years ago

Andrzej Zielezinski 11k

NUMBER OF POSSIBLE PAIRWISE ALIGNMENTS: The number of pairwise alignments increases factorially with the lengths of the sequences. If you consider Non-Boring Alignments(NBA) (where gaps are always paired with nucleotides) the number of all possible alignments of two sequences of lengths m and n is (m+n)!/m!*n!, so the formular for two sequences of equal length is so (2n)!/(n!)^2 which is often approximated (for practical reasons) to 2^2n/(π*n)^-0.5 [Lange (2002), Eddy, Nature Biotechnology (2004)].

For some small sequences' lengths, the number of possible alignments is already too big; for example fot two sequences of 11 nucleotides, there are 705,432 possible alignments, and for two sequences of 100 nucleotides, the number is astronomical (> 10^60).

NUMBER OF POSSIBLE MULTIPLE SEQUENCE ALIGNMENTS:

As for the number of possible alignments for more than two sequences, you can find the formula in Slowinski, Mol. Phylogenet. Evol (1998). Accordingly, there are 16,081 different alignments for 3 sequences of 3 characters each and 1.05*10^18 different alignments for 5 sequences of length of 5 nucleotides.

As a result, the calculation of an exact alignment becomes infeasible for relatively small sets of comparatively short sequences. For example, the dynamic programming algorithm implemented in the MSA program (Lipman et al., 1989) allows optimal alignment of up to roughly 10 protein sequences of small lengths of 200-300 residues.

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 9.4 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

That's wrong. According to your formula, two length-2 strings have 6 possible alignments. Actually, they could align in any of these ways (where M=match/mismatch, C=clip, D=deletion, I=insertion):

MM MID MDI IDM DIM IIDD IDDI DDII IDID DIDI DIID CMC CDIC CIDC

This assumes the sequences overlap and no operations are allowed for characters not present in some input string; otherwise, there would be infinite possible alignments.

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

1

Entering edit mode

The number of possible pairwise alignments is a Delannoy number and is equivalent to traversing a grid of size m by n from the bottom-left corner to the top-right corner where horizontal, vertical, and diagonal steps are allowed (the traceback in the Needleman-Wunsch algorithm).

ADD REPLY • link 3.4 years ago by gannondu ▴ 10

0

Entering edit mode

How do you define clip?

ADD REPLY • link 5.9 years ago by sgupt46 • 0

Ram · Answer 2 · 2016-02-25

There are far more than 3^m alignments for 2 strings. They can generate cigar strings (of match/mismatch, insertion, deletion, clip) symbols of length m to 2m (or 2m-1 if you require overlap), with some restrictions (for example, clips must be consecutive and on the end). 3^(2m) sounds realistic. It's definitely less than 4^(2m), and probably greater than 2^(2m). Let's suppose a constant Cwhich is greater than 2 and less than 4; the formula is C^(nm), where n is the number of strings and m is the length... I think. That would simplify to C^(sum of lengths) as long as you have at least two sequences.

You can also think of it like this:

You have N strings, and thus N specific alignments to a reference. That reference can only contain bases present in the strings you are aligning; thus, it can be at most N*(M-1)+1 bases long and at minimum 1 base long (if we assume these sequences overlap; otherwise it would be N*M and 0). But they can be arbitrarily interleaved. Interleavings which result in no two strings ever sharing a base are the worst-case scenario, which I expect to dominate the state-space.