Hello guys.
Some times ago I've asked here if there's an existing approach designed to extend 3' terminus of genes by a provided length: I received no answers, because apparently there's no one.
In my team we encountered this needing because of a 3' RNA-seq project against a poor annotated genome: as you could imagine, it was lacking of a curated 3' annotation, so our reads appear to map frequently outside from gene regions.
This script does extension of 3' terminus of each gene by a given value only if there's no overlap with another gene onto the same strand. Extension is done to:
- Gene
- 1st transcript
- 1st exon
- 1st CDS
- 3' UTR
When no explicit 3' UTR is present, it will be added. It is written in Python3 (needs python version higher than 3.4) with no external libraries or modules.
Now this is just an "exercise in style", because there's a need of an algorithm which is able to do from a data based approach. I've posted here mainly for suggestions on how to increase its accuracy and to start thinking about a data-based approach to re-design 3' annotation of a genome.
Would it not be better to use RNA-seq if available and Maker and/or Gmap to properly improve the annotations in the genome ?
Hi, as explained, there's a need for a data driven approach (I'm working on..). Here the problem is with a 3' RNA-seq, which produces reads in a different way than a standard TruSeq: tools for annotations rely on reads across whole genes, so they're not designed to work with this sequencing strategy. Another problem is that most sequencing by synthesis strategies lack of accuracy on 3' terminus: these regions were frequently cutted off while doing QC on reads.
So, from my experience, no existing tool was capable of improving 3' termini annotation with a 3'RNA-seq dataset. This is nothing more than a temporary solution to add missing UTR features and to extend them without overlapping existing genes.
If you're in knowledge with something fitting our needs, please let me know.
Nope, I have not heard of anything like this, so your approach seems reasonable. I only know of Portcullis which deals with correcting fake alternative splice sites from RNA-seq.
I think there are a couple of typos in your description which might improve understanding. Cheers for posting this script for others.
"3' terminus of a genome by a provided lenght" --> 3' terminus of a transcript by a provided length.
I just want to say thank you. Your post is very helpful.