Some times ago I've asked here if there's an existing approach designed to extend 3' terminus of genes by a provided length: I received no answers, because apparently there's no one. In my team we encountered this needing because of a 3' RNA-seq project against a poor annotated genome: as you could imagine, it was lacking of a curated 3' annotation, so our reads appear to map frequently outside from gene regions.
This script does extension of 3' terminus of each gene by a given value only if there's no overlap with another gene onto the same strand. Extension is done to:
- 1st transcript
- 1st exon
- 1st CDS
- 3' UTR
When no explicit 3' UTR is present, it will be added. It is written in Python3 (needs python version higher than 3.4) with no external libraries or modules.
Now this is just an "exercise in style", because there's a need of an algorithm which is able to do from a data based approach. I've posted here mainly for suggestions on how to increase its accuracy and to start thinking about a data-based approach to re-design 3' annotation of a genome.