Question

Retrieve 5'UTR sequences for ensembl_transcript_id's with unique start/end positions

0

Entering edit mode

3.4 years ago

omit3333 • 0

Hello everyone,

I am working with biomaRt to access Ensembl annotation (see more info here: http://127.0.0.1:29459/library/biomaRt/doc/accessing_ensembl.html) and I am trying to retrieve 5'UTR sequences from "ensembl_transcript_id" (with filter) together with the "5_utr_start" and "5_utr_end" positions.

Example code (R studio):

query <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id',"5_utr_start","5_utr_end","5utr"),filter = c("transcript_biotype","chromosome_name"), value = list(c("protein_coding"),c(1)), mart = ensembl)

For some "ensembl_transcript_id" entries this query gives me multiple "5_utr_start" and "5_utr_end" positions (separated by a semicolon). However, I get only a single 5'UTR sequence ("5utr") per "ensembl_transcript_id" for these entries. This means that I don't know which "5_utr_start" and "5_utr_end" positions are actually the correct ones for the displayed 5'UTR sequence ("5utr"). This is a problem for me because I need to know the exact starting & end position for the displayed UTR sequence for subsequent analysis.

Thank you for your help!

biomart ensembl R sequence • 1.6k views

ADD COMMENT • link 3.4 years ago by omit3333 • 0

score 2 · Answer 1 · 2020-11-23

2

Entering edit mode

3.4 years ago

Ben_Ensembl ★ 2.4k

Hi omit3333,

This will be caused by the UTR regions spanning several exons. E.g: https://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000000938;r=1:27612064-27635185;t=ENST00000374005

You'll want to take the most upstream and most downstream coordinates to get the overall start/end of the UTR.

ADD COMMENT • link 3.4 years ago by Ben_Ensembl ★ 2.4k

0

Entering edit mode

Hi Ben,

Thank you for your reply.

Do you know of any workaround to "mutate" a specific location within this UTR sequence?

Let's say I would like to mutate the nucleotide at genomic coordinates X to G within the 5'UTR sequence of a given transcript.

No I have the problem that I don't know the exact coordinates between start and end of the UTR (due to splicing).

Cheers, omit

ADD REPLY • link 3.4 years ago by omit3333 • 0

0

Entering edit mode

Hi omit,

I'm afraid I can't think of any obvious options. You may have to write a custom script that uses the genomic coordinates of the exons over which the 5' UTR spans to calculate the genomic coordinate of the position Xbp from the start codon.

Cheers

Ben

ADD REPLY • link 3.4 years ago by Ben_Ensembl ★ 2.4k

0

Entering edit mode

Hi Ben,

I translated this problem into a "real world" problem by creating a so-called "bridge game".

See here: https://stackoverflow.com/questions/65003498/r-programming-row-wise-data-frame-calculation-with-custom-script-for-every-i

Let's see if the community can solve it.

Cheers omit

ADD REPLY • link 3.4 years ago by omit3333 • 0