Question

How to get downstream promoter sequences using BiomaRt

0

Entering edit mode

9.2 years ago

FLDal ▴ 10

Hi,

I have a list of several thousand genes for which I would like to get the -1000:1000 promoter regions, using the start of the 5'utr as a mark defining upstream and downstream. I've been using the getsequence command in BiomaRt in R, and can easily retrieve the upstream sequences and 5'utr using this code

ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
Hs = getSequence(id = human, type="ensembl_gene_id",seqType="coding_gene_flank",upstream = 1000, mart=ensembl)

where "human" is a list of IDs. Additionally I can capture data excluding the 5'utr by using the seqType ="gene_flank". What I can't seem to do is capture the 1000 bp downstream from the start of the 5'utr and into the coding sequence. The program seems to only offer downstream regions that flank the gene. I've tried combinations of downloading coding_gene_flank and exons seqtypes, but the combined length of these sequences will differ from gene to gene as the 5'utr lengths differ.

Moreover, even if I could get the sequences I want, I can't seem to get both upstream and downstream sequences downloaded as a single sequences.

Is there some variation on my current code that would get me -1000:1000 slices that I'm missing? Should I simply move on to another program.

BiomaRt RNA-Seq Promoter-regions • 6.6k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by FLDal ▴ 10

0

Entering edit mode

Thanks SES. I actually don't know how to highlight the code in grey. Is it just a table cell?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by FLDal ▴ 10

Ram · Answer 1 · 2015-02-23

0

Entering edit mode

9.2 years ago

Emily 23k

I'm afraid there isn't a function in BioMart to download sequence upstream and downstream of the TSS, only upstream and downstream of the whole gene. This might be something you would want to do using the Ensembl Perl API. You could use the API to find your gene and get its start coordinate, then you would define a slice by that coordinate ± 1000. I can help you get started with the API if you like.

ADD COMMENT • link 9.2 years ago by Emily 23k

0

Entering edit mode

Emily_Ensembl that would be fantastic. I've never used it and not well experienced in Perl. I'm happy for any advice or tips you can offer.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by FLDal ▴ 10

0

Entering edit mode

To brush up on your Perl, I recommend this free online course, but other BioStars might have other suggestions. You can then take a look at the online course on the Ensembl API, from which you'll only need the Core module.

What you'll be aiming to do with your script is (and this will sound like garbage until you've had a look at the online course) use the GeneAdaptor to fetch_by_stable_id. Then you'll want to get the strand and chromosome (seq_region_name) of the gene. To get the TSS, you'll want the seq_region_start of positive strand genes or the seq_region_end of negative strand genes, then you can ± 1000 to get your start and end coordinates of your region of interest. You can then get the SliceAdaptor and use it to fetch_by_region using your locus (chromosome, strand, coordinates). Then you can get the seq for your slice.

Have a look at the online course, see if that makes sense, and we'll see where we can go from there.

ADD REPLY • link 9.2 years ago by Emily 23k

0

Entering edit mode

Thanks so much for your help!

ADD REPLY • link 9.2 years ago by FLDal ▴ 10

Ram · Answer 2 · 2015-02-24

0

Entering edit mode

9.2 years ago

Diego D. ▴ 50

I don't really sure if I understood well, but do you want to get the +- 1000 region from 5utr_start? right?

If that is the case, maybe you should try the "flank" function from GenomicRanges package.

For example,

ensembl=useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
#assuming you have ensembl ids
filters=c("ensembl_gene_id")

#We retrieve only the necessary attributes to create the GenomicRanges object. We set the start and end positions as the same (5utr start), because it is the only genomic coordinates we need from the genes.
attr=c("chromosome_name","5_utr_start","5_utr_start","external_gene_name","strand")
#your human ids
values= human

#Retrieve information from ensembl
genes.annot <- getBM(attr,filters=filters,values=values,ensembl)
genes.annot$chromosome_name <- paste0("chr",ncRNA.annot$chromosome_name)
#Create GenomicRanges object
genes.ranges<-with(genes.annot,GRanges(chromosome_name,IRanges(5_utr_start,5_utr_start,names=external_gene_name),strand=strand))
#Get +-1000 region that you need.
flank.ranges <- flank(genes.ranges,1000,both=T, use.names=T)

Hope that helps!

ADD COMMENT • link 9.2 years ago by Diego D. ▴ 50

0

Entering edit mode

Excellent. I'll give this a try and update the post.

ADD REPLY • link 9.2 years ago by FLDal ▴ 10

0

Entering edit mode

Hi @Diego D. I'm having a little trouble with the code.

genes.ranges<-with(genes.annot,GRanges(chromosome_name,IRanges(5_utr_start,5_utr_start,names=external_gene_name),strand=strand))

returns this error

Error: unexpected input in "genes.ranges<-with(genes.annot,GRanges(chromosome_name,IRanges(5_"

I've spent some time troubleshooting this and can't figure it out. It seems to want numeric maybe? Can you send advice?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by FLDal ▴ 10

0

Entering edit mode

Can someone advise me on why the IRanges part of this code is not working? I've been at it a few days and can't seem to figure it out.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by FLDal ▴ 10