Error When Reading Tigr Format Xml With Bioperl
1
1
Entering edit mode
10.5 years ago
Daniel ★ 4.0k

I am trying to read in the latest arabidopsis genome so I can search it for DNA motifs. It comes in tigr xml format but there is a line in the ftp site's README.txt saying:

1) We do not have anticodon data available in all cases.
2) We have added element <BAC>.
3) We have changed the definition of non-coding RNAs to include exons and splice variants
Thus, validation against the TIGR DTD can fail.
Please use the "tairxml.dtd" file for validation.

According to the Bioperl SeqIO for tigrxml documentation, the format should work but I'm getting the error:

 ./motif_hit ch2.xml

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: [1]Unknown or Invalid process directive:<?xml version="1.0" standalone="yes"?>
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::tigr::throw /usr/share/perl5/Bio/SeqIO/tigr.pm:1352
STACK: Bio::SeqIO::tigr::_process /usr/share/perl5/Bio/SeqIO/tigr.pm:205
STACK: Bio::SeqIO::tigr::_initialize /usr/share/perl5/Bio/SeqIO/tigr.pm:97
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:358
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:397
STACK: ./motif_hit:24
-----------------------------------------------------------

Is the tairxml.dtd causing the error and if so, how? I don't understand how to validate as it says.

My tester code:

#!/usr/bin/perl
use strict;
use warnings;

use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::IUPAC_v2;

my $usage = "USAGE:\t./motif_hit INFILE.xml <motif>";

my $infile = $ARGV[0] or die $usage . "\n";

my $in = Bio::SeqIO->new(-file => "$infile", -format => 'tigr');

while ( my $seq = $in->next_seq() ){
        print $seq->display_id;
}
bioperl xml • 3.8k views
ADD COMMENT
0
Entering edit mode

try to run xmllint ch2.xml and see if there is an error.

ADD REPLY
0
Entering edit mode

I just got: "ch2.xml:1232431: error: xmlSAX2Characters: huge text node: out of memory". I think that means it's reading it right.

ADD REPLY
0
Entering edit mode

no that does not mean that the file is correct. SAX parsing operates on streams and will stop on error before finishing reading the file

ADD REPLY
0
Entering edit mode

So does that mean the download corrupted? Sorry for the basic questions. EDIT: It appears the whole genome is in the final xml tags, which I assume must be the reason it's giving the "huge text node: out of memory" message. Is this not accommodated for though?

ADD REPLY
0
Entering edit mode

that makes sense, that certainly qualifies for a huge node ..

ADD REPLY
0
Entering edit mode
10.5 years ago

There may be two different things going on, the error that you get seems to complain about a process directive that usually is listed as the first line of the XML file.

But the xmllint program also seems to raise and error, though that is a different error altogether.

Long story short I think your file is not in the format that the program expects it to be in, and also is not complete.

If the file is not overly large you can try opening it in a browser, that should give you a nicely formatted output and may even indicate the exact location of the error.

ADD COMMENT
0
Entering edit mode

from opening it up in a browser, and tailing the file it looks complete in that all the tags close. I think the format is just not expected but as the link I put in the original shows, tigr is defined as an accepted format. I think the "Please use the "tairxml.dtd" file for validation." step is the issue but I dont know where/how to validate. Thanks for the help

ADD REPLY

Login before adding your answer.

Traffic: 2190 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6