Question: Needle Crashing on Specific Sequences
1
gravatar for juanfelipe2005
2.1 years ago by
juanfelipe200510 wrote:

Given two fasta files, attempting alignment:

==> seq1.fasta <==
>seq1
DVARNTGAYTS
==> seq2.fasta <==
>seq2
EVACNTGAYTS

Both needle and stretcher crash:

$/usr/local/bin/needle seq1.fasta seq2.fasta needle.out
Needleman-Wunsch global alignment of two sequences
Error: Sequence is not nucleic

Error: Unable to read sequence 'seq2.fasta'
Died: needle terminated: Bad value for '-bsequence' and no prompt

Error happens even when specifying

/usr/local/bin/needle seq1.fasta seq2.fasta needle.out -datafile EBLOSUM62 -auto Y -awidth3 999999999

Can narrow it down to this confusing example:

THIS WORKS

==> seq1.fasta <==
>seq1
SVAS
==> seq2.fasta <==
>seq2
SATA

THIS CRASHES NEEDLE

==> seq1.fasta <==
>seq1
SVAS
==> seq2.fasta <==
>seq2
SLTA

I can't find any reason why.

ADD COMMENTlink modified 2.1 years ago by 5heikki7.5k • written 2.1 years ago by juanfelipe200510
2

I got the same error (EMBOSS 6.3.1). What's interesting is that needle works fine when you change the order of your input sequences (providing seq2.fasta as the first one and then seq1.fasta).

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by a.zielezinski8.4k

... I thought this was a weird error, the order mattering makes it twice as much.

ADD REPLYlink written 2.1 years ago by juanfelipe200510
3
gravatar for 5heikki
2.1 years ago by
5heikki7.5k
Finland
5heikki7.5k wrote:

S,V, T and A are all valid IUPAC letters for nucleotides whereas L is not. If needle guesses nuc/prot based on seq1 (nevermind what substitution matrix you specify), then it's easy to explain a.zielezinski's observation since if you reverse the order (seq2 first) it can only be prot, whereas with seq1 it could be either: DVARNTGAYTS.. all these letters are valid IUPAC nucleotide code. However, the E in EVACNTGAYTS is not valid nucleotide code. You even get the error - "Error: Sequence is not nucleic"

tldr: Unless specified, needle probably guesses nuc/prot based on IUPAC letters and if there's not a single prot specific letter in seq1 then the guess is nuc.

Basically the program goes: Hmm, I sure wish user would have told me if these were nuc or prot seqs. Oh well, let's guess based on the first seq. First letter could be both, second letter could be both, .., all could be both. Hmm, I guess it's then likely that this is nuc seq. I sure wish the dev would have included A+T+G+C vs the rest frequency comparison. Oh well. Let's read seq 2. Oh what the hell, this is not nuc. Let's throw an error..

The fix? Use:

/usr/local/bin/needle seq1.fasta seq2.fasta needle.out -datafile EBLOSUM62 -auto Y -awidth3 999999999 -sprotein1 Y -sprotein2 Y
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by 5heikki7.5k
1

I always use "-sprotein" flag to indicate I am aligning protein sequences when using needle. This should indicate both are proteins and save a few keystrokes. e.g. "needle asequence=seq1.fasta bsequence=seq2.fasta gapopen=3 gapextend=1 outfile=needle.out -sprotein"

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Alastair Kerr5.2k

I just use muscle always..

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by 5heikki7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour