Shamelessly sharing an older tool ORFanage for ORF annotation - it might be useful to others who work with transcriptome assemblies and genome annotation.
While longest-ORF, most-upstream-ORF, or de novo prediction approaches often work well, they can sometimes miss biologically relevant isoforms, introduce errors or be inefficient for larger datasets. Our method solves these issues by selecting the most biologically consistent ORF for each transcript based on similarity to reference proteins, using an efficient interval-based algorithm.
In short, ORFanage:
- Finds the most likely ORF for each transcript in a GTF/GFF file based on maximizing similarity to proteins in one or more reference annotations.
- Quantifies frame shifts and other changes relative to the reference. Can also be used to perform exhaustive comparisons of annotated proteins between annotations.
- Scales efficiently to very large datasets using an interval-based pseudo-alignment algorithm avoiding costly sequence comparisons for most cases..
Additionally, we have recently added a small utility method ORFcompare to perform all-vs-all comparisons of CDS records between multiple annotation sources
When applied to large RNA-seq assemblies, ORFanage can help identify relevant transcripts, novel proteins, filter out noise and help take raw assemblies several steps closer towards complete annotations. It can also highlight inconsistencies or possible corrections in reference annotations—something we observed when applying it to RefSeq and GENCODE human datasets.
ORFanage and ORFcompare are both available on GitHub: https://github.com/alevar/ORFanage
You can also read more in the published study: https://pmc.ncbi.nlm.nih.gov/articles/PMC10718564/
Hope the methods are useful and easy to use!