I am trying to read in the latest arabidopsis genome so I can search it for DNA motifs. It comes in tigr xml format but there is a line in the ftp site's README.txt saying:
1) We do not have anticodon data available in all cases.
2) We have added element <BAC>.
3) We have changed the definition of non-coding RNAs to include exons and splice variants
Thus, validation against the TIGR DTD can fail.
Please use the "tairxml.dtd" file for validation.
According to the Bioperl SeqIO for tigrxml documentation, the format should work but I'm getting the error:
./motif_hit ch2.xml
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: [1]Unknown or Invalid process directive:<?xml version="1.0" standalone="yes"?>
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::tigr::throw /usr/share/perl5/Bio/SeqIO/tigr.pm:1352
STACK: Bio::SeqIO::tigr::_process /usr/share/perl5/Bio/SeqIO/tigr.pm:205
STACK: Bio::SeqIO::tigr::_initialize /usr/share/perl5/Bio/SeqIO/tigr.pm:97
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:358
STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:397
STACK: ./motif_hit:24
-----------------------------------------------------------
Is the tairxml.dtd causing the error and if so, how? I don't understand how to validate as it says.
My tester code:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::IUPAC_v2;
my $usage = "USAGE:\t./motif_hit INFILE.xml <motif>";
my $infile = $ARGV[0] or die $usage . "\n";
my $in = Bio::SeqIO->new(-file => "$infile", -format => 'tigr');
while ( my $seq = $in->next_seq() ){
print $seq->display_id;
}
try to run
xmllint ch2.xml
and see if there is an error.I just got: "ch2.xml:1232431: error: xmlSAX2Characters: huge text node: out of memory". I think that means it's reading it right.
no that does not mean that the file is correct. SAX parsing operates on streams and will stop on error before finishing reading the file
So does that mean the download corrupted? Sorry for the basic questions. EDIT: It appears the whole genome is in the final xml tags, which I assume must be the reason it's giving the "huge text node: out of memory" message. Is this not accommodated for though?
that makes sense, that certainly qualifies for a huge node ..