I want to extract all genomic location of promoters for mm9 with the corresponding transcript id/gene id/symbol. However, I have found out that there are duplicates ranges and sometimes two promoters correspond to one gene.
mm9 = TxDb.Mmusculus.UCSC.mm9.knownGene promoter<-promoters(mm9) > head(promoter) GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character>  chr1 [4795974, 4798173] + | 1 uc007afg.1  chr1 [4845775, 4847974] + | 3 uc007afi.2  chr1 [4846409, 4848608] + | 5 uc011whu.1
So, here are the duplicates ranges. What is the reason for having those duplicates ranges/promoters?
Then I need for each promoter a corresponding gene id/symbol
promoter = unique(promoter) gene_id_promoter = select(mm9, keys=as.character(promoter$tx_id), columns = c("TXNAME","GENEID"), keytype = "TXID") > head(gene_id_promoter) TXID GENEID TXNAME 1 1 18777 uc007afg.1 2 3 21399 uc007afi.2 3 5 21399 uc011whu.1 4 6 108664 uc007afm.1 5 8 18387 uc007afo.1 6 10 18387 uc007afq.1
Different transcript of a gene have the same gene id. But how is it possible that one gene can have two promoters? It means basically that two promoters (
uc011whu.1) correspond to one gene id (21399) and two different transcripts of the same gene. So, I took a look on my ranges again.
 chr1 [4845775, 4847974] + | 3 uc007afi.2  chr1 [4846409, 4848608] + | 5 uc011whu.1
uc007afi.2 is in the range of
uc011whu.1. How can it be explained? I have two promoters corresponding to one gene and two transcripts but one is in the range of another one. The reason for that is the not exact definition of a promoter region, isn't it? What region should I take to define a promoter region for a gene 21399?