Question: Refseq TSS vs Ensemble TSS
0
gravatar for srhic
5 months ago by
srhic0
srhic0 wrote:

Hello,

I have some mouse RNA-seq data and ATAC-Seq data and I am trying to corelate changes in gene expression with changes in promoter accessibility using HOMER. To do this I need a bed file of TSS around which I will analyze accessibility and then plot it with deeptools.

I am a bit confused about the number of TSS compared to the number of genes. I have a list of RefSeq TSSs which I downloaded from https://ccg.epfl.ch/mga/mm10/refseq/refseq.html and another which is included with HOMER. Both these files have around 23 thousand TSSs. My RNA-Seq data count file has ~ 46,000 ensemble ids. I am a bit confused about how to reconcile this difference.

If I only select TSSs which overlap between the RNA-seq count file and Refseq TSS file I will not be analyzing accessibility for almost half of the entries present in the RNA-seq data. Alternatively, if I download TSS for all ensemble ids from biomart, it gives me almost 100k entries as each gene can have multiple transcripts. I am a bit confused about this huge difference in no of Refseq TSS vs no of ensemble ids and the best way of doing this analysis. Would appreciate any tips.

Thanks

tss refseq ensemble homer • 251 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by srhic0
1
gravatar for ATpoint
5 months ago by
ATpoint28k
Germany
ATpoint28k wrote:

GENCODE/Ensembl contains far more transcripts than RefSeq, e.g. isoforms or non-coding transcripts, see e.g. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2

That means that for a given gene such as the one below, you get multiple TSS in GENCODE but often only one or few in RefSeq. enter image description here

To make your life easier you can limit your analysis to the principal isoform of every protein-coding gene, e.g. as listed in the APPRIS database. Still, there is always a bit of uncertainty as sometimes genes have multiple principal isoforms. I would filter for what APPRIS calls PRINCIPAL:1, then use something like -500 to +50bp of the TSS (mind the strand!). If there are multiple principal isoforms, simply use all of them and merge the regions in case of overlaps. Again, few genes will have principal isoform TSS that are quite far away from each other due to very long or very short transcripts of the same gene. It is on you what you do with them. Either use all or discard them, it will probably be the far minority of genes and probably does not make a bg difference. It is more important imho to capture the bulk of genes properly than spending too much time dealing with these outliers.

ADD COMMENTlink written 5 months ago by ATpoint28k

Thanks, that is very helpful and I will try it.

The problem is I am confused about is that my RNA seq count file has ~45000 ensemble ids and if I filter it in any way it will bias the differential expression analysis. Just to elaborate a bit more, I am interested in finding out if the deferentially expressed genes from my RNA-Seq data also show changes at the chromatin level or if they are being regulated by a chromatin independent mechanism. Standard dseq2 analysis gives me around 15000 ensembl ids that show differential expression with p value <0.05. Analysis with HOMER with the default Refseq TSS file to identify the TSS sites which show significant changes at the promoter (-500bp +100bp) gives me only around 400 sites. My preliminary conclusion is that the majority of degs are not being regulated at the chromatin level but I am concerned by the fact that the HOMER analysis was done on ~23000 refseq genes while dseq analysis was done on ~45000 ensemble ids.

ADD REPLYlink written 5 months ago by srhic0

I have little insight in the details of your analysis but I strongly suggest you stay consistent with the reference annotation you use. Don't mix them, this only adds uncertainty. Homer typically (at least its motif search functions) can accept custom references. Maybe try to give it the Ensembl annotations.

ADD REPLYlink written 5 months ago by ATpoint28k

Thanks. I was avoiding ensemble annotations because it gives a huge number of TSSs but as you said just downloading all ensemble TSSs and supplying them to HOMER maybe the best way to do it.

ADD REPLYlink written 5 months ago by srhic0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1714 users visited in the last hour