Question

Refseq TSS vs Ensembl TSS

1

Entering edit mode

4.6 years ago

srhic ▴ 60

Hello,

I have some mouse RNA-seq data and ATAC-Seq data and I am trying to correlate changes in gene expression with changes in promoter accessibility using HOMER. To do this I need a bed file of TSS around which I will analyze accessibility and then plot it with deeptools.

I am a bit confused about the number of TSS compared to the number of genes. I have a list of RefSeq TSSs which I downloaded from https://ccg.epfl.ch/mga/mm10/refseq/refseq.html and another which is included with HOMER. Both these files have around 23 thousand TSSs. My RNA-Seq data count file has ~ 46,000 ensembl ids. I am a bit confused about how to reconcile this difference.

If I only select TSSs which overlap between the RNA-seq count file and Refseq TSS file I will not be analyzing accessibility for almost half of the entries present in the RNA-seq data. Alternatively, if I download TSS for all ensemble ids from biomart, it gives me almost 100k entries as each gene can have multiple transcripts. I am a bit confused about this huge difference in no of Refseq TSS vs no of ensembl ids and the best way of doing this analysis. Would appreciate any tips.

Thanks

HOMER TSS ENSEMBL RefSeq • 3.0k views

ADD COMMENT • link updated 2.8 years ago by Ram 43k • written 4.6 years ago by srhic ▴ 60

score 2 · Answer 1 · 2019-08-27

2

Entering edit mode

4.6 years ago

ATpoint 81k

GENCODE/Ensembl contains far more transcripts than RefSeq, e.g. isoforms or non-coding transcripts, see e.g. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2

That means that for a given gene such as the one below, you get multiple TSS in GENCODE but often only one or few in RefSeq. enter image description here

To make your life easier you can limit your analysis to the principal isoform of every protein-coding gene, e.g. as listed in the APPRIS database. Still, there is always a bit of uncertainty as sometimes genes have multiple principal isoforms. I would filter for what APPRIS calls PRINCIPAL:1, then use something like -500 to +50bp of the TSS (mind the strand!). If there are multiple principal isoforms, simply use all of them and merge the regions in case of overlaps. Again, few genes will have principal isoform TSS that are quite far away from each other due to very long or very short transcripts of the same gene. It is on you what you do with them. Either use all or discard them, it will probably be the far minority of genes and probably does not make a bg difference. It is more important imho to capture the bulk of genes properly than spending too much time dealing with these outliers.

ADD COMMENT • link 4.6 years ago by ATpoint 81k

0

Entering edit mode

Thanks, that is very helpful and I will try it.

The problem is I am confused about is that my RNA seq count file has ~45000 ensemble ids and if I filter it in any way it will bias the differential expression analysis. Just to elaborate a bit more, I am interested in finding out if the deferentially expressed genes from my RNA-Seq data also show changes at the chromatin level or if they are being regulated by a chromatin independent mechanism. Standard dseq2 analysis gives me around 15000 ensembl ids that show differential expression with p value <0.05. Analysis with HOMER with the default Refseq TSS file to identify the TSS sites which show significant changes at the promoter (-500bp +100bp) gives me only around 400 sites. My preliminary conclusion is that the majority of degs are not being regulated at the chromatin level but I am concerned by the fact that the HOMER analysis was done on ~23000 refseq genes while dseq analysis was done on ~45000 ensemble ids.

ADD REPLY • link 4.6 years ago by srhic ▴ 60

1

Entering edit mode

I have little insight in the details of your analysis but I strongly suggest you stay consistent with the reference annotation you use. Don't mix them, this only adds uncertainty. Homer typically (at least its motif search functions) can accept custom references. Maybe try to give it the Ensembl annotations.

ADD REPLY • link 4.6 years ago by ATpoint 81k

1

Entering edit mode

Thanks. I was avoiding ensemble annotations because it gives a huge number of TSSs but as you said just downloading all ensemble TSSs and supplying them to HOMER maybe the best way to do it.

ADD REPLY • link 4.6 years ago by srhic ▴ 60