Question: Extracting Flanking Regions across TSS using R scripts
0
gravatar for r.tor
4 months ago by
r.tor40
r.tor40 wrote:

Hi,

Using biomaRt package in R, I have extracted the Transcription Start/Stop Sites of protein coding genes in human genome. So, I have a huge list of TSS. The next step is to get sequences of 1000 bp region upstream and downstream of retrieved TSS coordinates. I figured out that there isn't a function in biomaRt to download sequence upstream and downstream of the TSS. I do appreciate if you could make a suggestion to me.

ensembl coding tss bioconductor R • 375 views
ADD COMMENTlink modified 4 months ago by bernatgel1.9k • written 4 months ago by r.tor40

You will need to verify that this still works: A: How Do I Use Biomart To Get Upstream Flanking Sequence For A Gene?

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax65k
3
gravatar for bernatgel
4 months ago by
bernatgel1.9k
Barcelona, Spain
bernatgel1.9k wrote:

You can use the appropiate BSGenome from Bioconductor (probably BSgenome.Hsapiens.UCSC.hg19 or BSgenome.Hsapiens.UCSC.hg38. You should:

  1. Transform your list of list of TSS to a GRanges object

  2. Get the upstrem or downstream region with flank

  3. Use getSeq to retrieve the actual sequence from BSgenome

With this approach you'll only download the sequence once, when downloading the BSgenome package instead retrieveing it from the Biomart server every time you rerun the script.

ADD COMMENTlink written 4 months ago by bernatgel1.9k

Thank you for your answer and the tip! actually I didn't get the way to present the list of TSS to a GRanges object. I have a list of TSS coordinates that has been retrieved by the query like this:

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)
grep("transcript", att$name, value=TRUE)
getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
    filters ="biotype",
    values  =c("protein_coding"),
    mart    =mart)

and now, can you guide me how to transform the output to the BSgenome as you have mentioned. Thank you in advance.

ADD REPLYlink modified 4 months ago • written 4 months ago by r.tor40
1

An easy way to build GRanges is uding the toGRanges function from the package regioneR. It will read most "bed-like" formats (files, data.frames...) and transform them into GRanges objects.

library(biomaRt)
library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg19)

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
      filters ="biotype",
      values  =c("protein_coding"),
      mart    =mart)

#Build a GRanges with regioneR's toGRanges function
transcripts.gr <- toGRanges(transcripts)

#Filter out non-standard chromosomes
transcripts.gr <- keepSeqlevelstranscripts.gr, c(1:22,"X", "Y"), pruning.mode = "coarse")

#get the flanking regions from the start of the transcripts and from the end
flanking <- cflanktranscripts.gr, 1000, start = TRUE),
              flanktranscripts.gr, 1000, start = FALSE))

#Change the chromosomes names from Ensembl (1,2,3...) to UCSC (chr1, chr2, ch3...) so they match the chromosome names in the BSgenome
seqlevelsStyle(flanking) <- "UCSC"


#And get the sequences
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg19, flanking)

Take into account that you have many duplicate and overlaping regions, since multiple transcripts may start at the same position. Take also into account that you'll be getting 300k seqs, so it might be a bit slow.

Hope this helps

Bernat

ADD REPLYlink written 4 months ago by bernatgel1.9k

Thank you so much! it helped me a lot. The code worked well after a minor modifications. And as you've mentioned, the next would be considering the overlapping regions. So, I'm wondering about merging all sequences to get an understandable result. I think that I won't need the sequences and it would be possible to do with only coordinates of flanking regions. Please correct me if I am wrong.

ADD REPLYlink modified 4 months ago • written 4 months ago by r.tor40
1

Yes, to merge the regions or to manage the overlaps in any other way you don't need the sequences, you can do it with the coordinates only.

ADD REPLYlink written 4 months ago by bernatgel1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1512 users visited in the last hour