Question

How Can We Find The Info For 3'Utr And 5'Utr In Gencode Gtf File?

5

Entering edit mode

10.8 years ago

J.F.Jiang ▴ 920

Hi all,

I want to extract the genomic region of 3'UTR and 5'UTR from the GENCODE gtf annotation file to make a bed file for my analysis.

However, it seems that the gtf file did not seperate the 3'UTR and 5'UTR, rather they report UTR instead.

Although the UCSC table browser offers us these info, it is from v14, not the latest. And I really want to know how can they obtain such information?

If anyone knows the process pipeline, could you kindly tell me the flowchart?

Thanks.

utr • 14k views

ADD COMMENT • link updated 5 days ago by cmdcolin ★ 3.8k • written 10.8 years ago by J.F.Jiang ▴ 920

1

Entering edit mode

Could you please provide a link that someone can download such a GTF file to experiment with? You'll get a response much faster.

ADD REPLY • link 10.7 years ago by polarise ▴ 380

score 8 · Answer 1 · 2013-09-02

8

Entering edit mode

10.7 years ago

dario.garvan ▴ 520

It's not that easy. The GTF only has a feature category called UTR. The end user has to figure out if it is a 5' or 3' UTR. Here is a reproducible example in R of categorising them. Note that some protein-coding transcripts have a 5' UTR and no 3' UTR, a 3' UTR and no 5' UTR, or have neither UTR. It is important to remember the GENCODE annotation is a collection of transcript *fragments*, not full-length models.

library(GenomicRanges) # From Bioconductor.

genes <- read.table("gencode.v17.annotation.gtf", sep = '\t', skip = 5, stringsAsFactors = FALSE)
whichCodingTranscripts <- genes[, 3] == "transcript" & grepl("transcript_type protein_coding", genes[, 9], fixed = TRUE)
proteinTranscripts <- genes[whichCodingTranscripts, ]
strands <- proteinTranscripts[, 7]
allFeaturesTranscripts <- gsub("transcript_id ", '', sapply(strsplit(genes[, 9], "; "), '[', 2))
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
whichCDS <- genes[, 3] == "CDS" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsCDS <- genes[whichCDS, ]
transcriptsCDS <- split(GRanges(transcriptsCDS[, 1], IRanges(transcriptsCDS[, 4], transcriptsCDS[, 5]), transcriptsCDS[, 7]),
                    factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
firstCDS <- mapply(function(CDS, strand) {if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands)
lastCDS <-  mapply(function(CDS, strand) {if(strand == '+') {CDS[length(CDS)]} else {CDS[1]}}, transcriptsCDS, strands)
whichUTR <- genes[, 3] == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR[, 1], IRanges(transcriptsUTR[, 4], transcriptsUTR[, 5]), transcriptsUTR[, 7]),
                    factor(allFeaturesTranscripts[whichUTR], levels = names(firstCDS)))

transcriptsUTR5 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR < CDS[1]] else UTR[UTR > CDS[length(CDS)]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)

transcriptsUTR3 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)

ADD COMMENT • link 10.7 years ago by dario.garvan ▴ 520

0

Entering edit mode

Thanks, I have not updated my post, I finally used perl to extract the UTR information, and made a similar formula to get 3 and 5.

But very thanks for the R script, will compare the result with my result

ADD REPLY • link 10.7 years ago by J.F.Jiang ▴ 920

0

Entering edit mode

And now we can also get this data from UCSC table browser, since it already has these info for V17

ADD REPLY • link 10.7 years ago by J.F.Jiang ▴ 920

0

Entering edit mode

Not all organisms are annotated with the UTRs. Some has only xenoRefSeq tracks whereby they arbitrarily annotated the UTRs as 200bp upstream/downstream of the start/end codon.

ADD REPLY • link 9.0 years ago by wanziyi89 ▴ 60

0

Entering edit mode

the gencode link is broken now but here is a back up of that blogpost on archive https://web.archive.org/web/20130618221342/http://gencodegenes.wordpress.com:80/2012/11/20/gencode-why-are-there-partial-transcripts/

full text just for posterity, i'm not sure how much of it still applies to the latest gencode work though

"If you are just getting acquainted with GENCODE you may be wondering how our genebuild differs from others that are publically available.

Firstly, GENCODE annotation is based on the reference genome sequence and transcript/protein alignments, unlike the Refseq set which is cDNA driven. A major advantage of genomic annotation is that it does not depend on the availability of full length cDNAs; we capture a significant number of alternatively spliced variants from the use of partial transcripts e.g. ESTs. Ultimately, our goal is to capture the entire human transcriptome. However, the quality of the underlying genome sequence is crucial to our strategy; currently there are only three vertebrate genomes (human, mouse, and zebrafish) completed to a standard that warrants manual annotation.

For example, the ST7 “suppression of tumorigenicity 7” locus is represented by 19 different protein-coding transcripts in the GENCODE geneset (A), compared with just 2 in RefSeq (indicated by arrows in B). However, 8 of the GENCODE transcripts are partial length (indicated by *), having been generated from ESTs. It is our policy not to extend transcript models beyond the 5’ and 3’ limits of the evidence on which they are based. This is because we consider the true structure of the full length transcripts to be unpredictable, given our knowledge of alternative splicing. For this reason, GENCODE also contains partial coding sequences. In a later post, we will discuss why we feel it is important to capture transcriptional complexity in GENCODE."

ADD REPLY • link 5 days ago by cmdcolin ★ 3.8k

score 3 · Answer 2 · 2013-08-31

3

Entering edit mode

10.7 years ago

Michael 54k

That shouldn't be too difficult to figure, given you have UTRs annotated.

5'-UTRs are those UTRs at the 5' end of a transcript
3'-UTRS are those at the 3' end of a transcript

:)

ADD COMMENT • link 10.7 years ago by Michael 54k

0

Entering edit mode

I guess what OP has to do:

"transcript 3' coordinate" - "CDS 3' coordinate" = 3' UTR

ADD REPLY • link 10.7 years ago by PoGibas 5.1k

score 2 · Answer 3 · 2022-02-04

2

Entering edit mode

2.2 years ago

M. Hasan Celik ▴ 20

You can use gencode_utr_fix to update your gtf. It replaces UTR features with five_prime_utr and three_prime_utr features. For example:

gencode_utr_fix --input_gtf gencode.v29.annotation.gtf --output_gtf gencode.v29.annotation_utr.gtf

ADD COMMENT • link 2.2 years ago by M. Hasan Celik ▴ 20

score 1 · Answer 4 · 2024-04-26

1

Entering edit mode

6 days ago

Jalil Sharif ▴ 80

The updated code in R using rtracklayer for reading a gtf file.

https://github.com/jalilsharif/gtf_utr_fix_r/blob/main/gencode_utr_fix_datatable.R

ADD COMMENT • link 6 days ago by Jalil Sharif ▴ 80

score 1 · Answer 5 · 2024-04-26

This is way easier to do solely in R now 10 years later.

library(GenomicFeatures)
library(rtracklayer)

## import GTF
gtf <- makeTxDbFromGFF('gencode.v45.annotation.gtf')

##drop non-standard chromosomes
gtf <- keepStandardChromosomes(gtf,pruning='coarse')

## get 3' utrs
utr3 <- threeUTRsByTranscript(gtf)

## get 5' utrs
utr5 <- fiveUTRsByTranscript(gtf)

## export example
export.bed(utr5,con='gencode_5prime_UTRs.bed')