I am struggling with some problems reading Genebank files (.gbff) through Bioperl. I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation")). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ... JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYN...RTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQL...GRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEK...IFGSLF" ORIGIN 1 gatcctccat ata...tgatc //
If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format
>FEAT | FEAT | FEAT | FEAT agcgtgacgattvcatgt...catgcatgcag
>FEAT | FEAT | FEAT | FEAT AYTKRCL...MRTCKYS
Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field).
So my questions are the following:
1. How would you circumvent this kind of problem? What would you do?
2. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?
3. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie;in the preamble, but I think the die command is being given by Bioperl.
Note: this post was posted in Perlmonks a few days ago in here, but I'm trying to find a more efficient solution to what was posted there.