Question

Extracting Flanking Regions across TSS using R scripts

1

Entering edit mode

6.6 years ago

r.tor ▴ 50

Hi,

Using biomaRt package in R, I have extracted the Transcription Start/Stop Sites of protein coding genes in human genome. So, I have a huge list of TSS. The next step is to get sequences of 1000 bp region upstream and downstream of retrieved TSS coordinates. I figured out that there isn't a function in biomaRt to download sequence upstream and downstream of the TSS. I do appreciate if you could make a suggestion to me.

TSS R Bioconductor Coding ensembl • 6.0k views

ADD COMMENT • link updated 6.6 years ago by bernatgel ★ 3.4k • written 6.6 years ago by r.tor ▴ 50

0

Entering edit mode

You will need to verify that this still works: A: How Do I Use Biomart To Get Upstream Flanking Sequence For A Gene?

ADD REPLY • link 6.6 years ago by GenoMax 152k

score 4 · Accepted Answer · 2018-11-30

4

Entering edit mode

6.6 years ago

bernatgel ★ 3.4k

You can use the appropiate BSGenome from Bioconductor (probably BSgenome.Hsapiens.UCSC.hg19 or BSgenome.Hsapiens.UCSC.hg38. You should:

Transform your list of list of TSS to a GRanges object
Get the upstrem or downstream region with flank
Use getSeq to retrieve the actual sequence from BSgenome

With this approach you'll only download the sequence once, when downloading the BSgenome package instead retrieveing it from the Biomart server every time you rerun the script.

ADD COMMENT • link 6.6 years ago by bernatgel ★ 3.4k

0

Entering edit mode

Thank you for your answer and the tip! actually I didn't get the way to present the list of TSS to a GRanges object. I have a list of TSS coordinates that has been retrieved by the query like this:

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)
grep("transcript", att$name, value=TRUE)
getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
    filters ="biotype",
    values  =c("protein_coding"),
    mart    =mart)

and now, can you guide me how to transform the output to the BSgenome as you have mentioned. Thank you in advance.

ADD REPLY • link 6.6 years ago by r.tor ▴ 50

1

Entering edit mode

An easy way to build GRanges is uding the toGRanges function from the package regioneR. It will read most "bed-like" formats (files, data.frames...) and transform them into GRanges objects.

library(biomaRt)
library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg19)

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
      filters ="biotype",
      values  =c("protein_coding"),
      mart    =mart)

#Build a GRanges with regioneR's toGRanges function
transcripts.gr <- toGRanges(transcripts)

#Filter out non-standard chromosomes
transcripts.gr <- keepSeqlevelstranscripts.gr, c(1:22,"X", "Y"), pruning.mode = "coarse")

#get the flanking regions from the start of the transcripts and from the end
flanking <- cflanktranscripts.gr, 1000, start = TRUE),
              flanktranscripts.gr, 1000, start = FALSE))

#Change the chromosomes names from Ensembl (1,2,3...) to UCSC (chr1, chr2, ch3...) so they match the chromosome names in the BSgenome
seqlevelsStyle(flanking) <- "UCSC"


#And get the sequences
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg19, flanking)

Take into account that you have many duplicate and overlaping regions, since multiple transcripts may start at the same position. Take also into account that you'll be getting 300k seqs, so it might be a bit slow.

Hope this helps

Bernat

ADD REPLY • link 6.6 years ago by bernatgel ★ 3.4k

0

Entering edit mode

Thank you so much! it helped me a lot. The code worked well after a minor modifications. And as you've mentioned, the next would be considering the overlapping regions. So, I'm wondering about merging all sequences to get an understandable result. I think that I won't need the sequences and it would be possible to do with only coordinates of flanking regions. Please correct me if I am wrong.

ADD REPLY • link 6.6 years ago by r.tor ▴ 50

1

Entering edit mode

Yes, to merge the regions or to manage the overlaps in any other way you don't need the sequences, you can do it with the coordinates only.

ADD REPLY • link 6.6 years ago by bernatgel ★ 3.4k