Question: Get the location of a promoter from a gene
1
gravatar for Will Yam
9 weeks ago by
Will Yam10
Will Yam10 wrote:

Hello,

I'm trying to estimate the methylation of the cg sites located on the promoters of some genes.

So i have a csv with the cgsites and their beta.

I have a list of genes. now I'm trying to find the promoters of said genes.

Taking for example "TSPAN6", going to the ensemble website, I see the promoter tagged as

Stable ID   ENSR00000911389  
Type    Promoter  
Core bp Chromosome X: 100,635,001-100,637,600

I tried in R:

promoters(EnsDb.Hsapiens.v86, filter = GeneNameFilter("TSPAN6"))

or

promoters(TxDb.Hsapiens.UCSC.hg38.knownGene)

but it doesn't show this information

I also tried the biomart package:

regulation <- useEnsembl(biomart="regulation", dataset="hsapiens_regulatory_feature")
reg.attr <- listAttributes(regulation)
test.promo <- getBM(attributes = reg.attr$name, 
                    filters = 'regulatory_stable_id', 
                    values = "ENSR00000911389", 
                    mart = regulation)

but I can't find a field to relate the promoter to the gene...

In each case it seems i'm getting the transcripts (ie ENST*) associated to TSPAN6, but I need the start and end of the promoter so i can filter my cgsites/beta.

Anyone could please advise how to achieve that?

Regards,

biomart bioconductor R gene • 233 views
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Will Yam10
2

The promoters command you are using in R shifts the range to +2000 to -200 around the start of the gene. You might consider changing one parameter downstream=0 for your analysis. Promoters are not well annotated for a huge variety of reasons. Outside of specific promoters listed in the literature it will be tough to find the location of all the true promoters.

Either way using those ranges or the CAGE data as suggested by Alex should be a good starting point for your analysis.

Here is a snippet that should help your R code:

library(biomaRt)

genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
bm <- getBM(attributes = c("external_gene_name",'entrezgene_id'), values=names(genes),filters ='entrezgene_id', mart = mart)
names(genes) <- bm$external_gene_name[match(genes$gene_id,bm$entrezgene_id)]

promoters(genes,downstream=0)['TSPAN6']
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by benformatics1.9k

Hello Ben,

Thank you for the head start, much obliged! I have adapted your code to map the region found using promoters(), and I do end up with a list of cg sites:

> cgsites.tspan6$Name
[1] "cg16397002" "cg12801769" "cg02998933" "cg03791917" "cg08355317" "cg01519985"
[7] "cg00126698"

> cgsites.tspan6$pos
[1] 100641870 100644557 100640796 100641287 100642128 100644559 100640503

> cgsites.tspan6$Islands_Name
[1] "chrX:100645780-100646107" "chrX:100645780-100646107" ""                        
[4] ""                         "chrX:100645780-100646107" "chrX:100645780-100646107"
[7] ""

The problem is at this point i do not know if my results make sense or not. I tried looking at the region identified on the ensembl site, and could not see a promoter in this range. Would you know how I can tell if these are indeed the cg sites located on the promoter or not? also, it looks like the range returned by the promoters() function is the site of the upstream parameter. Am I correct in assuming that I cannot locate the cgsites on a promoter and that your code is basically returning a region of a size of my choice, that is supposed to contain the promoter?

code attached for the record:

library(dplyr)
library(minfi)
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

annotation <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
an.df <- as.data.frame(annotation)

genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
bm <- getBM(attributes = c("external_gene_name", 'entrezgene_id'), 
            values=names(genes),filters ='entrezgene_id', 
            mart = mart)
names(genes) <- bm$external_gene_name[match(genes$gene_id,bm$entrezgene_id)]

promoter.tspan6 <- promoters(genes, downstream=0, upstream = 5000)['TSPAN6']
gene.tspan6 <- genes['TSPAN6']

promoter.tspan6 <- as.data.frame(promoter.tspan6)
gene.tspan6 <- as.data.frame(gene.tspan6)


print(paste0(gene.tspan6$seqnames, ":", gene.tspan6$start, "-", gene.tspan6$end))
print(paste0(gene.tspan6$seqnames, ":", promoter.tspan6$start, "-", promoter.tspan6$end))

cgsites.tspan6 <- an.df %>%
  dplyr::filter(chr == gene.tspan6$seqnames) %>%
  dplyr::filter(pos >= promoter.tspan6$start) %>%
  dplyr::filter(pos <= promoter.tspan6$end)

cgsites.tspan6$Name
cgsites.tspan6$pos
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Will Yam10
1

There wouldn't/shouldn't be any annotation for promoters on the Ensembl site. (Maybe they have CAGE data but even then it wouldn't be annotated specifically as promoters.) Yes you are correct that the regions are the size of your choice upstream from annotated genes. (That is where biologically the promoter is supposed to exist.)

One thing to note is that your IlluminaHumanMethylationEPICanno.ilm10b4.hg19 is (i assume) relative to the hg19/37 genome annotation coordinates whereas the code we used is in the more recent hg38 genome coordinate annotations. That is a a big no-no. You should use TxDb.Hsapiens.UCSC.hg19.knownGene instead if you plan to stick with hg19.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by benformatics1.9k

ok, then i guess i'll just have to try to look at the average methylation levels of the cg sites in different range sizes (1kb -> 5kb) see if I can get anything exciting.
Noted reg not mixing hg19 vs hg38, I will have to rerun some analysis with the proper txdb.
All right, I don't think I will be able to figure out the cage thing at this point, but I feel I should be able to get somewhere with the bioconductor packages.

Thank you very much for all the tips and explanations!

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Will Yam10
1

Could you use CAGE TSS peaks, and then go upstream of the midpoint by some number of bases to define a proximal promoter (e.g., 1kb, 2kb, 5kb, etc.)?

ADD REPLYlink written 9 weeks ago by Alex Reynolds30k
1

Currently, promoters are not formally linked to genes in the Ensembl database. You can look at these two data types in parallel in the browser, but you have to make your own conclusions about which promoter is associated with a gene. We hope to be able to provide a direct linkage between genes and their promoters in the future.

ADD REPLYlink written 9 weeks ago by Astrid_Ensembl330

Hello Alex,

Thank you for the tip. If I understand you correctly, im supposed to locate the TSS from a file on the fantom site, and then look at all the cg sites in the - for example - 5kb upstream. Could you please advise which specific file I should look at? there are many of them and I'm struggling a little - well, a lot! I found also a reference to biomart on the site but could not either access it through R... Any hint appreciated!

ADD REPLYlink written 9 weeks ago by Will Yam10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 827 users visited in the last hour