Question: How Can We Find The Info For 3'Utr And 5'Utr In Gencode Gtf File?
2
gravatar for J.F.Jiang
5.8 years ago by
J.F.Jiang750
China
J.F.Jiang750 wrote:

Hi all,

I want to extract the genomic region of 3'UTR and 5'UTR from the GENCODE gtf annotation file to make a bed file for my analysis.

However, it seems that the gtf file did not seperate the 3'UTR and 5'UTR, rather they report UTR instead.

Although the UCSC table browser offers us these info, it is from v14, not the latest. And I really want to know how can they obtain such information?

If anyone knows the process pipeline, could you kindly tell me the flowchart?

Thanks.

utr • 7.0k views
ADD COMMENTlink modified 5.6 years ago by dario.garvan440 • written 5.8 years ago by J.F.Jiang750
1

Could you please provide a link that someone can download such a GTF file to experiment with? You'll get a response much faster.

ADD REPLYlink written 5.6 years ago by polarise380
4
gravatar for dario.garvan
5.6 years ago by
dario.garvan440
Australia
dario.garvan440 wrote:

It's not that easy. The GTF only has a feature category called UTR. The end user has to figure out if it is a 5' or 3' UTR. Here is a reproducible example in R of categorising them. Note that some protein-coding transcripts have a 5' UTR and no 3' UTR, a 3' UTR and no 5' UTR, or have neither UTR. It is important to remember the GENCODE annotation is a collection of transcript *fragments*, not full-length models.

library(GenomicRanges) # From Bioconductor.

genes <- read.table("gencode.v17.annotation.gtf", sep = '\t', skip = 5, stringsAsFactors = FALSE)
whichCodingTranscripts <- genes[, 3] == "transcript" & grepl("transcript_type protein_coding", genes[, 9], fixed = TRUE)
proteinTranscripts <- genes[whichCodingTranscripts, ]
strands <- proteinTranscripts[, 7]
allFeaturesTranscripts <- gsub("transcript_id ", '', sapply(strsplit(genes[, 9], "; "), '[', 2))
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
whichCDS <- genes[, 3] == "CDS" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsCDS <- genes[whichCDS, ]
transcriptsCDS <- split(GRanges(transcriptsCDS[, 1], IRanges(transcriptsCDS[, 4], transcriptsCDS[, 5]), transcriptsCDS[, 7]),
                    factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
firstCDS <- mapply(function(CDS, strand) {if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands)
lastCDS <-  mapply(function(CDS, strand) {if(strand == '+') {CDS[length(CDS)]} else {CDS[1]}}, transcriptsCDS, strands)
whichUTR <- genes[, 3] == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR[, 1], IRanges(transcriptsUTR[, 4], transcriptsUTR[, 5]), transcriptsUTR[, 7]),
                    factor(allFeaturesTranscripts[whichUTR], levels = names(firstCDS)))

transcriptsUTR5 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR < CDS[1]] else UTR[UTR > CDS[length(CDS)]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)

transcriptsUTR3 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)
ADD COMMENTlink written 5.6 years ago by dario.garvan440

Thanks, I have not updated my post, I finally used perl to extract the UTR information, and made a similar formula to get 3 and 5.

But very thanks for the R script, will compare the result with my result

ADD REPLYlink written 5.6 years ago by J.F.Jiang750

And now we can also get this data from UCSC table browser, since it already has these info for V17

ADD REPLYlink written 5.6 years ago by J.F.Jiang750

Not all organisms are annotated with the UTRs. Some has only xenoRefSeq tracks whereby they arbitrarily annotated the UTRs as 200bp upstream/downstream of the start/end codon.

ADD REPLYlink written 4.0 years ago by wanziyi8950
3
gravatar for Michael Dondrup
5.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

That shouldn't be too difficult to figure, given you have UTRs annotated.

  • 5'-UTRs are those UTRs at the 5' end of a transcript
  • 3'-UTRS are those at the 3' end of a transcript

:)

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Michael Dondrup46k

I guess what OP has to do:

"transcript 3' coordinate" - "CDS 3' coordinate" = 3' UTR
ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by PoGibas4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1546 users visited in the last hour