Question

How to get genes associated with open chromatin regions?

0

Entering edit mode

11 months ago

Chris ▴ 280

Hi all, I ran ATAC-seq pipeline such as nf-core and got output files such as bam, bigwig, broadpeak. Would you suggest a way to get genes associated with open chromatin regions? I used ChIPpeakAnno but for DiffBind. Thank you so much!

ATAC-seq • 1.6k views

ADD COMMENT • link 10 months ago by Chris ▴ 280

score 3 · Accepted Answer · 2023-06-18

3

Entering edit mode

11 months ago

ATpoint 82k

ATAC-seq annotation to gene names

It's one of the most unsatisfying problems in bioinformatics since imo all existing solutions are not good and only crude approximations of the reality. False assignment rate is probably gigantic.

There is plenty of literature on this problem discussing several approaches (pattern-based, correlation-based etc) but I cannot say that any of these has crystallized as a gold standard at all. It usually comes down to what is written in the linked answer. These distal-to-promoter associations are celltype-specific, and may change based on perturbation. It's really a tricks problem.

ADD COMMENT • link 11 months ago by ATpoint 82k

0

Entering edit mode

Thank you so much for your help! I am quite surprised that 2.4 years later, we still haven't a better solution as you said. So if we are not sure which genes are associated with open chromatin regions, what is the most helpful info we can get from an ATAC-seq experiment? If I have only one condition and don't perform differential accessibility analysis.

ADD REPLY • link 11 months ago by Chris ▴ 280

1

Entering edit mode

You can approximate which regulatory elements control a gene. With only one condition you don't know how accessability changes, so you would naively need to assign all called peaks within a given window to that gene, while using differential regions narrows it down by quite a lot. You can scan differential regions for motifs, thereby approximating the involved transcription factors. You can identify whether the treatment causes specific changes in terms of mainly restricting or promoting accessability...I mean...you have to know why you did the experiment, shouldn't you?

ADD REPLY • link 11 months ago by ATpoint 82k

0

Entering edit mode

Yes, the point of the experiment is to figure out how chromatin from a disease sample has changed in comparison with a control sample. I used ChIPpeakAnno to get the genes that DiffBind said these peaks are different in chromatin accessibility. But DiffBind gave me around 100k peaks and 20k genes after annotating so seem I didn't get much useful info yet. I hope to get something like a list of genes in control are closed but these gene in diseased are more open.

ADD REPLY • link 11 months ago by Chris ▴ 280

1

Entering edit mode

You might want to review your DiffBind analysis. 100k differential peaks is unlikely to be meaningful. Many ATAC-seq experiments do not even yield 100k peaks in total.

ADD REPLY • link 11 months ago by ATpoint 82k

0

Entering edit mode

Thank you! So 115941 ranges are not correct? enter image description here

ADD REPLY • link 11 months ago by Chris ▴ 280

1

Entering edit mode

Read my comment, I said differential. The way you phrased it it seemed that DiffBind gave you 100k differential peaks. Total peaks is not interesting, how many differential (e.g. FDR < 0.05, abs(logFC)>1) do you have. What does 20k genes after annotating mean? Which data do you have, is it ATAC-seq and something else?

ADD REPLY • link 11 months ago by ATpoint 82k

0

Entering edit mode

When I applied abs(log2FC) > 2.5, I got around 115941 peaks has FDR < 0.05 and abc(logFC>1) I have a lot. enter image description here I tried to annotated each peak with a gene that I added a symbol column and I got around 20k unique genes name.

ADD REPLY • link 11 months ago by Chris ▴ 280

1

Entering edit mode

I have my doubts that this is meaningful, it seems that everything changes which makes no intuituve sense, but without having the data I cannot really say what is wrong. I advise, if at all possible, to collaborate with someone experienced locally (or have your PI find a good collaboprator) to have a look at this analysis and figure out what is going wrong. Continuing with this excessive amount of "DE regions" is unlikely imo to yield anything substantial. You need to narrow down the DE regions, for example do more prefiltering and prioritize large FCs, for example lfcShrink in DESeq2 or glmTreat in edgeR. If this is all new to you I again recommend to find a collaborator to make sure you don't waste many weeks on results from a potentially flawaed analysis. Hope that helps.

ADD REPLY • link 10 months ago by ATpoint 82k

0

Entering edit mode

I has been looking for a mentor for more than a year but no luck yet. I reached out to all I know locally but helping someone is not an obligation so even I has a few replies, I know more a little bit but not completely solve the question. Which I can't agree more that it was very ineffective. For some reasons, my PI doesn't want a collaborator anymore. I used Diffbind which I remember I was recommended and its code are quite straightforward so I guess if errors happen from the sample sheet or the nf-core output. enter image description here Another bioinformatician who working at a biotech company analyzed the same data and also got around 100k peaks.
Does this sample sheet look correct?

ADD REPLY • link 10 months ago by Chris ▴ 280