Hello everyone,
I am working with biomaRt to access Ensembl annotation (see more info here: http://127.0.0.1:29459/library/biomaRt/doc/accessing_ensembl.html) and I am trying to retrieve 5'UTR sequences from "ensembl_transcript_id" (with filter) together with the "5_utr_start" and "5_utr_end" positions.
Example code (R studio):
query <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id',"5_utr_start","5_utr_end","5utr"),filter = c("transcript_biotype","chromosome_name"), value = list(c("protein_coding"),c(1)), mart = ensembl)
For some "ensembl_transcript_id" entries this query gives me multiple "5_utr_start" and "5_utr_end" positions (separated by a semicolon). However, I get only a single 5'UTR sequence ("5utr") per "ensembl_transcript_id" for these entries. This means that I don't know which "5_utr_start" and "5_utr_end" positions are actually the correct ones for the displayed 5'UTR sequence ("5utr"). This is a problem for me because I need to know the exact starting & end position for the displayed UTR sequence for subsequent analysis.
Thank you for your help!
Hi Ben,
Thank you for your reply.
Do you know of any workaround to "mutate" a specific location within this UTR sequence?
Let's say I would like to mutate the nucleotide at genomic coordinates X to G within the 5'UTR sequence of a given transcript.
No I have the problem that I don't know the exact coordinates between start and end of the UTR (due to splicing).
Cheers, omit
Hi omit,
I'm afraid I can't think of any obvious options. You may have to write a custom script that uses the genomic coordinates of the exons over which the 5' UTR spans to calculate the genomic coordinate of the position Xbp from the start codon.
Cheers
Ben
Hi Ben,
I translated this problem into a "real world" problem by creating a so-called "bridge game".
See here: https://stackoverflow.com/questions/65003498/r-programming-row-wise-data-frame-calculation-with-custom-script-for-every-i
Let's see if the community can solve it.
Cheers omit