I have some mouse RNA-seq data and ATAC-Seq data and I am trying to corelate changes in gene expression with changes in promoter accessibility using HOMER. To do this I need a bed file of TSS around which I will analyze accessibility and then plot it with deeptools.
I am a bit confused about the number of TSS compared to the number of genes. I have a list of RefSeq TSSs which I downloaded from https://ccg.epfl.ch/mga/mm10/refseq/refseq.html and another which is included with HOMER. Both these files have around 23 thousand TSSs. My RNA-Seq data count file has ~ 46,000 ensemble ids. I am a bit confused about how to reconcile this difference.
If I only select TSSs which overlap between the RNA-seq count file and Refseq TSS file I will not be analyzing accessibility for almost half of the entries present in the RNA-seq data. Alternatively, if I download TSS for all ensemble ids from biomart, it gives me almost 100k entries as each gene can have multiple transcripts. I am a bit confused about this huge difference in no of Refseq TSS vs no of ensemble ids and the best way of doing this analysis. Would appreciate any tips.