Question

Error When Reading Tigr Format Xml With Bioperl

1

Entering edit mode

10.5 years ago

Daniel ★ 4.0k

I am trying to read in the latest arabidopsis genome so I can search it for DNA motifs. It comes in tigr xml format but there is a line in the ftp site's README.txt saying:

1) We do not have anticodon data available in all cases.
2) We have added element <BAC>.
3) We have changed the definition of non-coding RNAs to include exons and splice variants
Thus, validation against the TIGR DTD can fail.
Please use the "tairxml.dtd" file for validation.

According to the Bioperl SeqIO for tigrxml documentation, the format should work but I'm getting the error:

 ./motif_hit ch2.xml

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: [1]Unknown or Invalid process directive:<?xml version="1.0" standalone="yes"?>
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::tigr::throw /usr/share/perl5/Bio/SeqIO/tigr.pm:1352
STACK: Bio::SeqIO::tigr::_process /usr/share/perl5/Bio/SeqIO/tigr.pm:205
STACK: Bio::SeqIO::tigr::_initialize /usr/share/perl5/Bio/SeqIO/tigr.pm:97
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:358
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:397
STACK: ./motif_hit:24
-----------------------------------------------------------

Is the tairxml.dtd causing the error and if so, how? I don't understand how to validate as it says.

My tester code:

#!/usr/bin/perl
use strict;
use warnings;

use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::IUPAC_v2;

my $usage = "USAGE:\t./motif_hit INFILE.xml <motif>";

my $infile = $ARGV[0] or die $usage . "\n";

my $in = Bio::SeqIO->new(-file => "$infile", -format => 'tigr');

while ( my $seq = $in->next_seq() ){
        print $seq->display_id;
}

bioperl xml • 3.8k views

ADD COMMENT • link updated 10.5 years ago by Istvan Albert 100k • written 10.5 years ago by Daniel ★ 4.0k

0

Entering edit mode

try to run xmllint ch2.xml and see if there is an error.

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I just got: "ch2.xml:1232431: error: xmlSAX2Characters: huge text node: out of memory". I think that means it's reading it right.

ADD REPLY • link 10.5 years ago by Daniel ★ 4.0k

0

Entering edit mode

no that does not mean that the file is correct. SAX parsing operates on streams and will stop on error before finishing reading the file

ADD REPLY • link 10.5 years ago by Istvan Albert 100k

0

Entering edit mode

So does that mean the download corrupted? Sorry for the basic questions. EDIT: It appears the whole genome is in the final xml tags, which I assume must be the reason it's giving the "huge text node: out of memory" message. Is this not accommodated for though?

ADD REPLY • link 10.5 years ago by Daniel ★ 4.0k

0

Entering edit mode

that makes sense, that certainly qualifies for a huge node ..

ADD REPLY • link 10.5 years ago by Istvan Albert 100k

score 0 · Answer 1 · 2013-10-24

0

Entering edit mode

10.5 years ago

Istvan Albert 100k

There may be two different things going on, the error that you get seems to complain about a process directive that usually is listed as the first line of the XML file.

But the xmllint program also seems to raise and error, though that is a different error altogether.

Long story short I think your file is not in the format that the program expects it to be in, and also is not complete.

If the file is not overly large you can try opening it in a browser, that should give you a nicely formatted output and may even indicate the exact location of the error.

ADD COMMENT • link 10.5 years ago by Istvan Albert 100k

0

Entering edit mode

from opening it up in a browser, and tailing the file it looks complete in that all the tags close. I think the format is just not expected but as the link I put in the original shows, tigr is defined as an accepted format. I think the "Please use the "tairxml.dtd" file for validation." step is the issue but I dont know where/how to validate. Thanks for the help

ADD REPLY • link 10.5 years ago by Daniel ★ 4.0k