Question: New To Bioinformatics, What Is A 6-Mer When It Comes To Alignment Algorithms (And When Do I Choose Which Ones To Use)?
3
7.1 years ago by
Matt30
United States
Matt30 wrote:

I've tried to search a bit first, but have been unable to find a quick answer. In the documentation for biopython I come across this:

``````#Distance is calculated based on the number of shared 6mers.
``````

This is the default alignment algorithm for MAFFT. What is a 6mer?

Also when should I choose the alternatives over the 6mer pairwise alignment? Those alternatives are:

• Needleman-Wunsch (global pairwise)
• Smith-Waterman (local pairwise)
• Local pairwise with generalized affine gap cost (Altschul 1998)
sequence python alignment • 6.3k views
modified 6.0 years ago by Biostar ♦♦ 20 • written 7.1 years ago by Matt30
4

You need to consult the MAFFT documentation for details; the BioPython documentation is just a basic description of their interface to MAFFT.

Thanks that page does seem to spell all this out much clearer. Don't know why I didn't think of that first. Guess I had too much on my mind at the moment.

It's ok, the reasons behind this might not be clear or they might be explicit in the manual. I'm always surprised how many accepted common conventions are based on random choices or arbitrary values.

6
6.2 years ago by
Torst960
Australia
Torst960 wrote:

A 6-mer is a subsequence of 6 letters (nucleotides or amino-acids).

For example, a length 10 DNA sequence has 5 possible 6-mers: the sequences using bases 1..6, 2..7, 3..8, 4..9, and 5..10.  Think of a sliding window of length 6 moving across the sequence one letter at a time.

Aligning two sequences with the alternatives you list is the "optimal" way but it takes time proportional to the product (*) of the lengths of the two sequences. The distance can be measured from the alignment score.

To speed this up for the case where the sequences are reasonably similar, another way is to count the number of shared k-mers. MAFFT must use k=6. This can be done more quickly than full alignment.

The decision is a trade-off between computation and sensitivity. k=6 might be ok for protein but may fail for DNA if the sequences diverge to any great extent.