Question: Conflicting consensus sequences from different tools (bug in hmmemit?)
0
gravatar for jrj.healey
21 days ago by
jrj.healey2.4k
United Kingdom
jrj.healey2.4k wrote:

I'm in the process of creating concensus sequences, and my specific aim is to remove any ambiguous characters (for reasons that are unimportant to the question).

I couldn't find a simple tool that would do what I want satisfactorily, so I started writing my own (until someone reminded me about HMMs but more on that is a second).

I have the following MSA as a AlignIO object:

>>> print(msa)
SingleLetterAlphabet() alignment with 16 rows and 149 columns
MSTTPEQIAVEYPIPTYRFVVSLGDEQIPFNSVSGLDISHDVIE...QAA PAU_02775
MSTTPEQIAVEYPIPTYRFVVSIGDEQIPFNSVSGLDISHDVIE...QAA PLT_01696
MSTTPEQIAVEYPIPTYRFVVSIGDEQVPFNSVSGLDISHDVIE...QAA PAK_02606
MSTTPEQIAVEYPIPTYRFVVSIGDEKVPFNSVSGLDISHDVIE...QAA PLT_01736
MTTTT----VDYPIPAYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PAK_01896
MATTT----VDYPIPAYRFVVSVGDEQIPFNSVSGLDITYDVIE...QAA PAU_02074
MSVTTEQIAVDYPIPTYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PLT_02424
MTITPEQIAVDYPIPAYRFVVSVGDEKIPFNNVSGLDVHYDVIE...QAP PLT_01716
MAITPEQIAVEYPIPTYRFVVSVGDEQIPFNNVSGLDVHYDVIE...QAA PLT_01758
MSTSTSQIAVEYPIPVYRFIVSIGDDQIPFNSVSGLDINYDTIE...QAV PAK_03203
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIE...QAV PAU_03392
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAK_02014
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAU_02206
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAK_01787
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAU_01961
MSTTVDQIAVQYPIPTYRFVVTVGDEQMSFQSVSGLDISYDTIE...EFH PLT_02568
                      *

(Note the asterisk I've added for the moment).

If I build a concensus with hmmemit -c ("majority-rule consensus sequence"), I get the following sequence:

>hmmemit-consensus    *
MSTTAEQIAVEYPIPTYRFVVSVGDEQIPFNSVSGLDISYDVIEYKDGVGNYYKMPGQRQ
AINITLRKGVFSGDTKLFDWINSIQLNQVEKKDISISLTNEAGTEILLTWSVANAFPTSL
TSPSFDATSNEVAVQEISLTADRVTIQAA

At column 23 (22 if zero based) (the asterisk), hmmemit places a Valine (V). My own tool however, places an Isoleucine (L). Part of my process is to use collections.Counter to score the string, and in that column, I occurs 8 times, and V 7 times. Why is hmmemit choosing a lower frequency amino acid?

Basic process of my script (relevant parts only):

def enumerate_string(string):
    """Returns the most common characters of a string. Multiple characters are returned if there are equally frequent characters."""

    from collections import Counter
    counts = Counter(string)

    keys = []
    for key, value in counts.iteritems():
        if value == max(counts.values()):
            keys.append(key)

    return keys

>>> msa = AlignIO.read('myalignment.fasta', 'fasta')
>>> msa_summary = AlignInfo.SummaryInfo(msa)

 >>> Counter(msa_summary.get_column(22))
Counter({'I': 8, 'V': 7, 'L': 1})
>>> enumerate_string(msa_summary.get_column(22))
['I']

This clearly shows that Isoleucine is the most common, so why doesn't hmmemit choose it?

bipython hmm python consensus • 220 views
ADD COMMENTlink modified 21 days ago by cryptogenomicon70 • written 21 days ago by jrj.healey2.4k
3
gravatar for cryptogenomicon
21 days ago by
United States
cryptogenomicon70 wrote:

Because hmmemit is emitting the maximum likelihood sequence, according to the profile HMM's parameterization.

("majority-rule" in hmmemit means w.r.t. the profile HMM, not w.r.t. the MSA that the HMM was built from. hmmemit takes the HMM as input, not the MSA.)

ADD COMMENTlink written 21 days ago by cryptogenomicon70

Thanks Sean, that makes a lot of sense. I figured it was probably taking in to consideration positions in the alignment other than just the column in question, but evidently misunderstood what the majority rule was exactly doing.

Just to add a small follow up question, it's important for what I'm doing that there are no ambiguous positions in the final consensus sequence. Does hmmemit always call an amino acid/base or will it also output ambiguous positions?

ADD REPLYlink written 21 days ago by jrj.healey2.4k
1

hmmemit will only output canonical residues (20 aa or 4 nt).

ADD REPLYlink written 18 days ago by cryptogenomicon70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1419 users visited in the last hour