Question: Perl program: The sequence does not appear to be FASTA format (lacks a descriptor line '>')
1
gravatar for dago
4.6 years ago by
dago2.5k
Germany
dago2.5k wrote:

I get the following error when running a perl program:

Use of uninitialized value $Bio::DB::NCBIHelper::HOSTBASE in concatenation (.) or string at /usr/share/perl5/Bio/DB/Query/GenBank.pm line 103.
Use of uninitialized value $Bio::DB::NCBIHelper::HOSTBASE in concatenation (.) or string at /usr/share/perl5/Bio/DB/Query/GenBank.pm line 104.
outDir: Test1/

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
STACK: Bio::SeqIO::fasta::next_seq /usr/share/perl5/Bio/SeqIO/fasta.pm:136
STACK: Guidance::name2codeFastaFrom1 /usr/local/lib/guidance.v1.5/www/Guidance/Guidance.pm:1220
STACK: /usr/local/lib/guidance.v1.5/www/Guidance/guidance.pl:445


 However, I am quite sure that all my seq are fasta. Here an example:

cat 2312_Ad2_02358.faa -A
>Ad2_02358 Chaperone protein ClpB$
MDFEKYTERARGFIQSAQTYALGQGHQQFTPAHILKVLLDDSEGMSAGLIERAGGRAQDVRLQIETDLAALPKVSGGNGQLYLSPEIARLFEQAEKIAEKAGDSYVTVERLLLALALDKGSQAGKALAQGGVTPSGLNEAINGLRKGRTADSASAENQYDALKKFAQDLTQAARDGKLDPVIGRDEEIRRAIQVLSRRTKNNPVLIGEPGVGKTAIAEGL

 

What I am missing here?

 

EDIT

Here is the file with the seqs

ADD COMMENTlink modified 4.6 years ago by jairly0 • written 4.6 years ago by dago2.5k
2

Difficult to say what you are missing without seeing the complete file - the file itself, not a copy/paste here.

However, clearly you are missing something :)  You may be "quite sure" but the fasta parser is equally sure that at least one sequence is invalid - and in my experience, the parser is generally correct. Convincing yourself that you know better than the error message is a common mistake and it will not lead to solutions.

ADD REPLYlink written 4.6 years ago by Neilfws48k

Agree with you. I added a link to the file containing the seqs, maybe I am missing something there.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by dago2.5k

If your file comes from a Windows machine, you might use dos2unix on your file to strip any extraneous Windows carriage return characters, which can interfere with parsing on UNIX platforms.

ADD REPLYlink written 4.6 years ago by Alex Reynolds28k

if all you headers have numbers you can check for missing ">" in the header by executing: 

perl -lne 'if(/\d+/){$t++;print "$t\t$_" unless />/}' inputFile


or a cmd pipe equivalent, but without the actual code and an example it's hard to say.

ADD REPLYlink written 4.6 years ago by mxs530

@Alex Reynolds thanks, but all my files come from unix. @mxs The file reported above contains only 6 sequences and I manually checked them. There is always a `>` at the starting of the seq.

ADD REPLYlink written 4.6 years ago by dago2.5k

Have you tried removing (replacing with underscore) blanks from the header? Otherwise I see no obvious "mistake".

ADD REPLYlink written 4.6 years ago by mxs530

Thanks! I tried, but same problem. The program I am using is creating a folder with the results. If a conflict with the folder name is created (e.g. same outdir names) the program crashes.

ADD REPLYlink written 4.6 years ago by dago2.5k

Could you maybe explain this dirname conflict a bit please?

ADD REPLYlink written 4.6 years ago by RamRS24k
1

Sure. I use guidance.pl and it asks me for an ouDir name. If the dir name is the same as an existing one I get the error, if not it runs correctly.

ADD REPLYlink written 4.6 years ago by dago2.5k
1

Guidance looks like a really complicated script+package. I'll run a local check on next_seq with your file. If it works, there's something wrong with either how guidance passes parameters or how you're using the tool. In the meantime, could you also update the question with the exact command you're running please? Thank you!

EDIT: I ran a simple Bio::Seq script on it and it works fine. We're probably looking at an error in usage or an untested anomaly in the guidance package.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by RamRS24k

This is the code that finally worked

for i in *.faa; do guidance.pl  --seqFile $i --msaProgram MUSCLE --seqType aa --outDir TEST/$i --muscle /usr/bin/muscle --proc_num 20 --datasets $i ; done

However, if I run the following it runs the firs seq and if gives me the error:

for i in *.faa; do guidance.pl  --seqFile $i --msaProgram MUSCLE --seqType aa --outDir test1_$i --muscle /usr/bin/muscle --proc_num 20 --datasets $i ; done
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by dago2.5k

There's nothing in $1, $i is the loop variable.

ADD REPLYlink written 4.6 years ago by RamRS24k

sorry there was a typo.

Also,

guidance.pl  --seqFile 2746_Ad2_02800.faa --msaProgram MUSCLE --seqType aa --outDir Gui --muscle /usr/bin/muscle --proc_num 20

It works, but if I try to run it again whit the outDir Gui already there it gives me the error above.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by dago2.5k

That's strange, especially considering how guidance is existing folder tolerant from the brief glance I gave to the code. What is the output of:

set | grep "noclobber"
ADD REPLYlink written 4.6 years ago by RamRS24k

I agree that guidance is probably the issue. I also tried a simple Bioperl script, no errors with your file.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "2312_Ad2_02358.faa", -format => "fasta");

while(my $seq = $seqio->next_seq) {
    print $seq->display_id, "\n";
}
ADD REPLYlink written 4.6 years ago by Neilfws48k

Someone once told me it's better to use use warnings; instead of perl -w.

Ref: C: How to copy all fasta-seqs from fasta-files with the seq-lengths between minlen

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by RamRS24k

Some use use warnings FATAL => 'all'; to make the script die on warnings. Seems like a good defensive approach.

ADD REPLYlink written 4.6 years ago by Alex Reynolds28k

The script shouts KMN if it gets a papercut :)

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by RamRS24k
0
gravatar for jairly
4.6 years ago by
jairly0
United Kingdom
jairly0 wrote:

Hi,

Maybe it is coming across other non-standard characters in the >fasta_header... recently I found the pipe "|" character in fasta files and it was causing me problems.

Do you have an example of the perl script working on another fasta file?

ADD COMMENTlink written 4.6 years ago by jairly0

Hi, thanks for the suggestion. I do not have "|" in my header. I do not why, but as I wrote in few comments above there was a problem with the directory.

ADD REPLYlink written 4.6 years ago by dago2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour