I want to extract all genomic location of promoters for mm9 with the corresponding transcript id/gene id/symbol. However, I have found out that there are duplicates ranges and sometimes two promoters correspond to one gene.
mm9 = TxDb.Mmusculus.UCSC.mm9.knownGene promoter<-promoters(mm9) > head(promoter) GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character>  chr1 [4795974, 4798173] + | 1 uc007afg.1  chr1 [4845775, 4847974] + | 3 uc007afi.2  chr1 [4846409, 4848608] + | 5 uc011whu.1
So, here are the duplicates ranges. What is the reason for having those duplicates ranges/promoters?
Then I need for each promoter a corresponding gene id/symbol
promoter = unique(promoter) gene_id_promoter = select(mm9, keys=as.character(promoter$tx_id), columns = c("TXNAME","GENEID"), keytype = "TXID") > head(gene_id_promoter) TXID GENEID TXNAME 1 1 18777 uc007afg.1 2 3 21399 uc007afi.2 3 5 21399 uc011whu.1 4 6 108664 uc007afm.1 5 8 18387 uc007afo.1 6 10 18387 uc007afq.1
Different transcript of a gene have the same gene id. But how is it possible that one gene can have two promoters? It means basically that two promoters (uc007afi.2, uc011whu.1) correspond to one gene id (21399) and two different transcripts of the same gene. So, I took a look on my ranges again.
 chr1 [4845775, 4847974] + | 3 uc007afi.2
 chr1 [4846409, 4848608] + | 5 uc011whu.1
uc007afi.2 is in the range of uc011whu.1. How can it be explained? I have two promoters corresponding to one gene and two transcripts but one is in the range of another one. The reason for that is the not exact definition of a promoter region, isn't it? What region should I take to define a promoter region for a gene 21399?