How to deal with multiple Transcription Start Site in a gene?
1
0
Entering edit mode
5.8 years ago

In order to measure the conservation (Phastcons score) of each gene, i am trying to look at -2000:2000 base pairs below and above from transcription start site (TSS), which will represent my promoter. But the problem is, there are more than one TSS for each gene. What should i do? which one should i choose?

Promoter Phastcons Score TSS • 2.6k views
1
Entering edit mode
5.8 years ago

If you want to avoid confusion you can simply use the gene coordinates from the Known Gene UCSC track:

> library(Homo.sapiens)
> library(phastCons100way.UCSC.hg19)
> mypromoters = promoters(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), 2000, -2000)
> mypromoters$cons100way = scores(phastCons100way.UCSC.hg19, mypromoters) # takes a while > mypromoters GRanges object with 23056 ranges and 2 metadata columns: seqnames ranges strand | gene_id cons100way <Rle> <IRanges> <Rle> | <character> <numeric> 1 chr19 [ 58874015, 58876214] - | 1 0.10872727 10 chr8 [ 18246755, 18248954] + | 10 0.36422727 100 chr20 [ 43280177, 43282376] - | 100 0.04400000 1000 chr18 [ 25757246, 25759445] - | 1000 0.01645455 10000 chr1 [244006687, 244008886] - | 10000 0.03836364 ... ... ... ... ... ... ... 9991 chr9 [115095745, 115097944] - | 9991 0.25700000 9992 chr21 [ 35734323, 35736522] + | 9992 0.29409091 9993 chr22 [ 19109768, 19111967] - | 9993 0.01236364 9994 chr6 [ 90537619, 90539818] + | 9994 0.07536364 9997 chr22 [ 50964706, 50966905] - | 9997 0.15281818 ------- seqinfo: 93 sequences (1 circular) from hg19 genome  ADD COMMENT 0 Entering edit mode Hi again The code below is not working for me, mypromoters$cons100way = scores(phastCons100way.UCSC.hg19, mypromoters) what should i do? another question is, instead of gene_id i put tx_name column. but can i convert these UCSC ids to ensmble gene ID. actually i convert it by getBM package of R, but it is not seems good.what do you think?