Question: New To Bioinformatics, What Is A 6-Mer When It Comes To Alignment Algorithms (And When Do I Choose Which Ones To Use)?
gravatar for Matt
7.5 years ago by
United States
Matt30 wrote:

I've tried to search a bit first, but have been unable to find a quick answer. In the documentation for biopython I come across this:

#Distance is calculated based on the number of shared 6mers.

This is the default alignment algorithm for MAFFT. What is a 6mer?

Also when should I choose the alternatives over the 6mer pairwise alignment? Those alternatives are:

  • Needleman-Wunsch (global pairwise)
  • Smith-Waterman (local pairwise)
  • Local pairwise with generalized affine gap cost (Altschul 1998)
sequence python alignment • 6.6k views
ADD COMMENTlink modified 6.4 years ago by Biostar ♦♦ 20 • written 7.5 years ago by Matt30

You need to consult the MAFFT documentation for details; the BioPython documentation is just a basic description of their interface to MAFFT.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Neilfws49k

Thanks that page does seem to spell all this out much clearer. Don't know why I didn't think of that first. Guess I had too much on my mind at the moment.

ADD REPLYlink written 7.5 years ago by Matt30

It's ok, the reasons behind this might not be clear or they might be explicit in the manual. I'm always surprised how many accepted common conventions are based on random choices or arbitrary values.

ADD REPLYlink written 7.5 years ago by Josh Herr5.7k
gravatar for Torst
6.6 years ago by
Torst960 wrote:

A 6-mer is a subsequence of 6 letters (nucleotides or amino-acids).

For example, a length 10 DNA sequence has 5 possible 6-mers: the sequences using bases 1..6, 2..7, 3..8, 4..9, and 5..10.  Think of a sliding window of length 6 moving across the sequence one letter at a time.

Aligning two sequences with the alternatives you list is the "optimal" way but it takes time proportional to the product (*) of the lengths of the two sequences. The distance can be measured from the alignment score.

To speed this up for the case where the sequences are reasonably similar, another way is to count the number of shared k-mers. MAFFT must use k=6. This can be done more quickly than full alignment.

The decision is a trade-off between computation and sensitivity. k=6 might be ok for protein but may fail for DNA if the sequences diverge to any great extent.


ADD COMMENTlink written 6.6 years ago by Torst960
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2837 users visited in the last hour