Question

Long terminal gaps in MAFFT alignment

0

Entering edit mode

6.3 years ago

Ghoti ▴ 90

I've got a peculiar problem where MAFFT amino acid pairwise alignments will produce long terminal gaps, separating a few (usually less than 5) residues with dozens of gaps when a more straightforward/logical alignment would entirely lack gaps. I do not believe it to be a problem with alignment scoring (I've tweeked gap penalties to ridiculous degrees and it will still produce terminal gaps). Instead, it appears to be a quirk of the code, resulting when two sequences differ greatly in length. I can amend the problem using regular expressions but otherwise I doubt I'll be able to avoid occasional/random terminal gaps. I'm curious though if anybody else can reproduce the problem I'm having (perhaps different versions of MAFFT lack this quirk). I'm currently using MAFFT v7.310 and here's my code:

from Bio.Align.Applications import MafftCommandline

def PW_MAFFT_align(sample_seq_AA, consensus_seq_AA):
    #create fasta file to be read by MAFFT
    sample_tList = [['>sample_seq_AA', sample_seq_AA], ['>consensus_seq_AA', consensus_seq_AA]]
    ex_file = open("/Users...fasta", "w")

    for items in sample_tList:
        ex_file.write(items[0] + "\n")
        ex_file.write(items[1] + "\n")

    ex_file.close()

    in_file = "/Users...fasta"
    mafft_exe = "/usr/local/bin/mafft"

    mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5) 

    stdout, stderr = mafft_cline()
    test_align = AlignIO.read(io.StringIO(stdout), "fasta")

    #delete temporary file
    os.remove("/Users...fasta")

    test_align_str = [str(test_align[0].seq), str(test_align[1].seq)]

    return(test_align_str)

sample_tList = [['>Frame 1', 'LSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]

test_PW_MAFFT_align_corr = PW_MAFFT_align(sample_tList[0][1], sample_tList[1][1])
for i in test_PW_MAFFT_align_corr:
    print(i)

My current output:

LSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNY RRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHL AAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVL WLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFT PGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLG WFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTF FSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGN LNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPME HSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKG VRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPEL CFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRP ASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKII REARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFF GHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK

---------MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNY RRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHL AAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVL WLRIPMLRSVFGFRWLGATFLLNS--(805 total gaps)--W--------------------

Note: the output was weird (involuntarily bolded)

Edit: I forgot to mention that by changing the last residue (Tryptophan) to near any other amino acid, the resulting alignment will not have a terminal gap.

python mafft alignment • 3.8k views

ADD COMMENT • link 6.3 years ago by Ghoti ▴ 90

0

Entering edit mode

I’ve seen this too with the MAFFT binary, but what the solution is, I have no idea...

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

Thanks @jrj.healey for verifying my troubles.

Part of my problem is that I'm trying to align viral sequences in which recombination is reasonably likely. Although I think it's less likely to occur (and result in a functional polypeptide) at terminal ends, I haven't observed translate ends to have high conservation. This leaves me with a slight dilemma in possible treatments.

I can assume that any long terminal gaps perhaps numbering 10-20 fold greater than the remaining terminal residues can be assumed in error. This creates an oversimplification of the alignment scoring process, but is likely to be accurate barring recombination. Alternatively, I can narrow the scope of the alignment and hope the problem resolves itself upon realignment. Perhaps a combination of the two treatments would benefit me.

ADD REPLY • link 6.3 years ago by Ghoti ▴ 90

1

Entering edit mode

I’ve recently been exploring alignment of sequences viral in origin/ancestry (though not actually viruses), via progressiveMauve. It has an option for —colinear and segmented alignments which means that you can ‘allow’ it to detect recombined chunks. It may work better for what you want? The output is not a straightforward MSA or multifasta though (it outputs aligned segments), so it may depend what you need downstream..

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

Thanks for the recommendation. I realize I've been a unnecessarily vague with my work. I'm attempting to create a completely automatic method for annotating translates (open reading frames and non-structural proteins) in single stranded positive sense viral swine pathogens (namely PRRSV). My current script is rather inelegant and can be improved substantially. I suspect it is currently only 80-90% accurate (based on comparison with given GenBank annotations). I've only really seriously considered using MAFFT so far since it's command line implementation is relatively straightforward and the analysis time is reasonable. I'm currently running my annotation script entirely on a laptop, but I wouldn't mind trying to multithread on a server (which should allow me to use other alignment programs with slower run times). I look forward to trying out porgressiveMauve, although I'll have to consider how/if I can process the segmented output.

ADD REPLY • link 6.3 years ago by Ghoti ▴ 90

0

Entering edit mode

What do you need the alignment for when calling ORFs? I may be missing the logic...

Incidentally, I would report your findings above to the mafft developers. I’d be interested in seeing what they say, particularly the point about tryptophan.

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

I'm using the alignment to compare the consensus of an ORF to a raw PRRSV sequence. Alignment allows me to annotate the location of the the ORF in the raw sequence. From my understanding, alignments are the best way to find viral coding features assuming the products are known, not predicted.

I wouldn't mind contacting the MAFFT developers about this issue. I think I'll try to find additional examples first though. I can't recall if issues were limited to certain amino acids (Tryptophan in this instance). I think it happening at near random with a variety of residues causing errors depending on situation. Will report back if I hear anything compelling from the MAFFT developers.

ADD REPLY • link 6.3 years ago by Ghoti ▴ 90

1

Entering edit mode

Ah yeah, I see your logic. Posted late at night so clearly wasn't very awake!

If you're aligning many shorter ORFs, I've had good results with MUSCLE in the past (for large numbers of smaller alignment lengths), so that may be another option.

ADD REPLY • link 6.3 years ago by Joe 21k

score 1 · Answer 1 · 2018-01-29

Contacted the creator of MAFFT (Dr. Katoh), who suggested using a lep value of -0.5. This solved my terminal gap problem by "penalizing long internal gaps more strongly than terminal gaps. As a result, a terminal gap (W-----$) is inserted instead of the internal gap (-----W$) while the W/W match is lost." By default, MAFFT has a low internal gap penalty which makes the alignment more robust. "The default parameter of mafft is designed to allow long internal and terminal gaps, in order not to overlook conserved regions surrounded by long insertions/deletions."