Forum: Review of the EVAL package (generating stats for GTF files)
1
gravatar for liruiradiant
3.7 years ago by
liruiradiant10
United States
liruiradiant10 wrote:

In my recent research, I tried to use 'EVAL' to generate some statistics for the genome annotation file generated from a cufflinks -> TransDecoder pipeline. However, I found this package disappointing.


My high expectations:

The Eval package had very detailed documents and promises to generate statistics about transcript length, CDS length, UTR length, and genome coverage. My goal is to evaluate my new assembly, especially in terms of the increase in 3UTR length. After reading the instructions to this tool, I thought I found a great solution.

My disappointments:

  1. The Eval package only supports GTF2 format, with start_codon/end_codon entries as necessity:

    This is reasonable. So I used gffread -T to transform the GFF3 file I got from TrandDecoder, then spent half a day writing scripts to add start/stop codon to the GTF file. I thought I'm close to getting the rewards of the result.

  2. The Eval generates wrong 3UTR length information:

    Using the GTF2 file I generated, I got very complete statistics for the 3UTR for the GTF file. I was exited...before I checked the result. Here is a list of the annoying facts:

    a. generating 3UTR information for sequences without 3UTR.
    b. warning 'overlapping exon region between stop codon and 3UTR.
    c. The transcript length reported by EVAL includes intron length, which makes the result completely useless for me.
    d. The validate_gtf.pl -f is supposed to fix formatting issue in any GTF2 file and infer UTR lines from 'Exon' and 'CDS' information, however, it adds UTR information for GTF2 files without CDS info provided.


My painful lesson learned: The EVAL package have serious bugs while the documentation looks so detailed and nice.

(#EVAL Package: Keibler, E., & Brent, M. R. (2003). Eval: a software package for analysis of genome annotations. BMC Bioinformatics, 4(1), 50. doi:10.1186/1471-2105-4-50)

ADD COMMENTlink modified 7 hours ago by liu98278850 • written 3.7 years ago by liruiradiant10

I don't find it very surprising that a 13 years old software does not support GFF3, I think the format was defined after that, but it is a little hard to find a resource on when exactly it was conceived. Most statistics you mention could potentially be trivial to implement using the Bio* parsers of today, or also Bioconductor. If you think that it is still relevant for your project to calculate statistics for a genome annotation, you could provide a list of the requirements, and we could check them against existing libraries and questions on Biostars.

ADD REPLYlink written 3.7 years ago by Michael Dondrup46k

Useful! ...so that I don't fall into wasting time on it.

ADD REPLYlink written 3.5 years ago by Hranjeev1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour