Question: How to get downstream promoter sequences using BiomaRt
gravatar for FLDal
4.2 years ago by
FLDal10 wrote:


I have a list of several thousand genes for which I would like to get the -1000:1000 promoter regions, using the start of the 5'utr as a mark defining upstream and downstream. I've been using the getsequence command in BiomaRt  in R, and can easily retrieve the upstream sequences and 5'utr using this code  

ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
Hs = getSequence(id = human, type="ensembl_gene_id",seqType="coding_gene_flank",upstream = 1000, mart=ensembl)          

where "human" is a list of IDs. Additionally I can capture data excluding the 5'utr by using the seqType ="gene_flank". What I can't seem to do is capture the 1000 bp downstream from the start of the 5'utr and into the coding sequence. The program seems to only offer downstream regions that flank the gene. I've tried combinations of downloading coding_gene_flank and exons seqtypes, but the combined length of these sequences will differ from gene to gene as the 5'utr lengths differ. 

 Moreover, even if I could get the sequences I want, I can't seem to get both upstream and downstream sequences downloaded as a single sequences. 

Is there some variation on my current code that would get me -1000:1000 slices that I'm missing? Should I simply move on to another program. 

ADD COMMENTlink modified 4.1 years ago • written 4.2 years ago by FLDal10

Thanks SES. I actually don't know how to highlight the code in grey. Is it just a table cell?

ADD REPLYlink written 4.2 years ago by FLDal10
gravatar for Emily_Ensembl
4.2 years ago by
Emily_Ensembl18k wrote:

I'm afraid there isn't a function in BioMart to download sequence upstream and downstream of the TSS, only upstream and downstream of the whole gene. This might be something you would want to do using the Ensembl Perl API. You could use the API to find your gene and get its start coordinate, then you would define a slice by that coordinate ± 1000. I can help you get started with the API if you like.

ADD COMMENTlink written 4.2 years ago by Emily_Ensembl18k

Emily_Ensembl that would be fantastic. I've never used it and not well experienced in Perl. I'm happy for any advice or tips you can offer. 

ADD REPLYlink written 4.2 years ago by FLDal10

To brush up on your Perl, I recommend this free online course, but other BioStars might have other suggestions. You can then take a look at the online course on the Ensembl API, from which you'll only need the Core module.

What you'll be aiming to do with your script is (and this will sound like garbage until you've had a look at the online course) use the GeneAdaptor to fetch_by_stable_id. Then you'll want to get the strand and chromosome (seq_region_name) of the gene. To get the TSS, you'll want the seq_region_start of positive strand genes or the seq_region_end of negative strand genes, then you can ± 1000 to get your start and end coordinates of your region of interest. You can then get the SliceAdaptor and use it to fetch_by_region using your locus (chromosome, strand, coordinates). Then you can get the seq for your slice.

Have a look at the online course, see if that makes sense, and we'll see where we can go from there.

ADD REPLYlink written 4.2 years ago by Emily_Ensembl18k

Thanks so much for your help!

ADD REPLYlink written 4.1 years ago by FLDal10
gravatar for Diego D.
4.2 years ago by
Diego D.40
Diego D.40 wrote:

I don't really sure if I understood well, but do you want to get the +- 1000 region from 5utr_start? right?

If that is the case, maybe you should try the "flank" function from GenomicRanges package.

For example,

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
#assuming you have ensembl ids

#We retrieve only the necessary attributes to create the GenomicRanges object. We set the start and end positions as the same (5utr start), because it is the only genomic coordinates we need from the genes.
#your human ids
values= human

#Retrieve information from ensembl
genes.annot <- getBM(attr,filters=filters,values=values,ensembl)
genes.annot$chromosome_name <- paste0("chr",ncRNA.annot$chromosome_name)
#Create GenomicRanges object
#Get +-1000 region that you need.
flank.ranges <- flank(genes.ranges,1000,both=T, use.names=T)

Hope that helps!

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Diego D.40

Excellent. I'll give this a try and update the post.

ADD REPLYlink written 4.1 years ago by FLDal10

Hi @Diego D.  I'm having a little trouble with the code. 


returns this error 

Error: unexpected input in "genes.ranges<-with(genes.annot,GRanges(chromosome_name,IRanges(5_"

I've spent some time troubleshooting this and can't figure it out. It seems to want numeric maybe? Can you send advice?

ADD REPLYlink written 4.1 years ago by FLDal10

Can someone advise me on why the IRanges part of this code is not working? I've been at it a few days and can't seem to figure it out. 

ADD REPLYlink written 4.1 years ago by FLDal10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 936 users visited in the last hour