Question

New To Bioinformatics, What Is A 6-Mer When It Comes To Alignment Algorithms (And When Do I Choose Which Ones To Use)?

3

Entering edit mode

10.7 years ago

Matt ▴ 30

I've tried to search a bit first, but have been unable to find a quick answer. In the documentation for biopython I come across this:

#Distance is calculated based on the number of shared 6mers.

This is the default alignment algorithm for MAFFT. What is a 6mer?

Also when should I choose the alternatives over the 6mer pairwise alignment? Those alternatives are:

Needleman-Wunsch (global pairwise)
Smith-Waterman (local pairwise)
Local pairwise with generalized affine gap cost (Altschul 1998)

sequence alignment python • 8.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.7 years ago by Matt ▴ 30

4

Entering edit mode

You need to consult the MAFFT documentation for details; the BioPython documentation is just a basic description of their interface to MAFFT.

ADD REPLY • link 10.7 years ago by Neilfws 49k

0

Entering edit mode

Thanks that page does seem to spell all this out much clearer. Don't know why I didn't think of that first. Guess I had too much on my mind at the moment.

ADD REPLY • link 10.7 years ago by Matt ▴ 30

0

Entering edit mode

It's ok, the reasons behind this might not be clear or they might be explicit in the manual. I'm always surprised how many accepted common conventions are based on random choices or arbitrary values.

ADD REPLY • link 10.7 years ago by Josh Herr 5.8k

Ram · Accepted Answer · 2014-07-26

A 6-mer is a sub-sequence of 6 letters (nucleotides or amino-acids).

For example, a length 10 DNA sequence has 5 possible 6-mers: the sequences using bases 1..6, 2..7, 3..8, 4..9, and 5..10. Think of a sliding window of length 6 moving across the sequence one letter at a time.

Aligning two sequences with the alternatives you list is the "optimal" way but it takes time proportional to the product (*) of the lengths of the two sequences. The distance can be measured from the alignment score.

To speed this up for the case where the sequences are reasonably similar, another way is to count the number of shared k-mers. MAFFT must use k=6. This can be done more quickly than full alignment.

The decision is a trade-off between computation and sensitivity. k=6 might be ok for protein but may fail for DNA if the sequences diverge to any great extent.