In my recent research, I tried to use 'EVAL' to generate some statistics for the genome annotation file generated from a cufflinks -> TransDecoder pipeline. However, I found this package disappointing.
My high expectations:
The Eval package had very detailed documents and promises to generate statistics about transcript length, CDS length, UTR length, and genome coverage. My goal is to evaluate my new assembly, especially in terms of the increase in 3UTR length. After reading the instructions to this tool, I thought I found a great solution.
The Eval package only supports GTF2 format, with start_codon/end_codon entries as necessity:
This is reasonable. So I used gffread -T to transform the GFF3 file I got from TrandDecoder, then spent half a day writing scripts to add start/stop codon to the GTF file. I thought I'm close to getting the rewards of the result.
The Eval generates wrong 3UTR length information:
Using the GTF2 file I generated, I got very complete statistics for the 3UTR for the GTF file. I was exited...before I checked the result. Here is a list of the annoying facts:
a. generating 3UTR information for sequences without 3UTR.
b. warning 'overlapping exon region between stop codon and 3UTR.
c. The transcript length reported by EVAL includes intron length, which makes the result completely useless for me.
d. The validate_gtf.pl -f is supposed to fix formatting issue in any GTF2 file and infer UTR lines from 'Exon' and 'CDS' information, however, it adds UTR information for GTF2 files without CDS info provided.
My painful lesson learned: The EVAL package have serious bugs while the documentation looks so detailed and nice.
(#EVAL Package: Keibler, E., & Brent, M. R. (2003). Eval: a software package for analysis of genome annotations. BMC Bioinformatics, 4(1), 50. doi:10.1186/1471-2105-4-50)