Question

Tool:Extend 3' UTR of a GTF file

2

Entering edit mode

3.5 years ago

Shred ★ 1.4k

Hello guys.

Some times ago I've asked here if there's an existing approach designed to extend 3' terminus of genes by a provided length: I received no answers, because apparently there's no one.

In my team we encountered this needing because of a 3' RNA-seq project against a poor annotated genome: as you could imagine, it was lacking of a curated 3' annotation, so our reads appear to map frequently outside from gene regions.

This script does extension of 3' terminus of each gene by a given value only if there's no overlap with another gene onto the same strand. Extension is done to:

Gene
1st transcript
1st exon
1st CDS
3' UTR

When no explicit 3' UTR is present, it will be added. It is written in Python3 (needs python version higher than 3.4) with no external libraries or modules.

Now this is just an "exercise in style", because there's a need of an algorithm which is able to do from a data based approach. I've posted here mainly for suggestions on how to increase its accuracy and to start thinking about a data-based approach to re-design 3' annotation of a genome.

Script

annotation python RNA-Seq GTF • 2.7k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 3.5 years ago by Shred ★ 1.4k

0

Entering edit mode

Would it not be better to use RNA-seq if available and Maker and/or Gmap to properly improve the annotations in the genome ?

ADD REPLY • link 3.5 years ago by colindaven 6.3k

0

Entering edit mode

Hi, as explained, there's a need for a data driven approach (I'm working on..). Here the problem is with a 3' RNA-seq, which produces reads in a different way than a standard TruSeq: tools for annotations rely on reads across whole genes, so they're not designed to work with this sequencing strategy. Another problem is that most sequencing by synthesis strategies lack of accuracy on 3' terminus: these regions were frequently cutted off while doing QC on reads.
So, from my experience, no existing tool was capable of improving 3' termini annotation with a 3'RNA-seq dataset. This is nothing more than a temporary solution to add missing UTR features and to extend them without overlapping existing genes.

If you're in knowledge with something fitting our needs, please let me know.

ADD REPLY • link 3.5 years ago by Shred ★ 1.4k

1

Entering edit mode

Nope, I have not heard of anything like this, so your approach seems reasonable. I only know of Portcullis which deals with correcting fake alternative splice sites from RNA-seq.

I think there are a couple of typos in your description which might improve understanding. Cheers for posting this script for others.

"3' terminus of a genome by a provided lenght" --> 3' terminus of a transcript by a provided length.

ADD REPLY • link 3.5 years ago by colindaven 6.3k

0

Entering edit mode

I just want to say thank you. Your post is very helpful.

ADD REPLY • link 2.8 years ago by realismsy • 0

score 0 · Answer 1 · 2021-10-24

Hi, I believe the ESAT tool is designed to do something similar to what you are looking for. The authors applied the tool to a scRNA-seq dataset, but indicate that it can also be used on bulk experiments. Reference below.

Derr, Alan, et al. "End Sequence Analysis Toolkit (ESAT) expands the extractable information from single-cell RNA-seq data." Genome research 26.10 (2016): 1397-1410.

https://github.com/garber-lab/ESAT