Hello,
I have two cell lines (clones derived from patient with mutation; mutation was rescued in one cell line but not the other) with ChIP, ATAC, and RNA-Seq in each -- and I want to compare them. The differential analysis yielded lots of pathways like "Chr9p13" and "Amplicon Chr6q22 in Breast Cancer", which lead to us to consider that there might be CNVs between these cell lines. We did a high-coverage WGS run to confirm this -- we ended up finding large-scale copy number changes between them.
Now that I know what the copy number changes are, I am wondering if anyone is aware of tools/approaches to using that information in re-analyzing our ChIP, ATAC, and RNA Seq data?
This was the approach I was considering at first: (1) Normalizing the ChIP and ATAC Seq by using the WGS as the Input for peak calling OR (1) Take the results from something like DiffBind and weighing the fold change and p value for differential peaks in CNV regions by simply dividing the -log10(pVal) and fold change by the copy fold change at the location and (2) Maybe don't normalize the RNA-Seq by CNV because the CNVs lead to biologically-meaningful transcriptomic differences
Thank you for your time, Henry Miller
Maybe you can get some inspiration from the Bioc thread where Aaron Lun (csaw and other packages author) comments on a similar scenario. In that case I asked about trisomy of an entire chromosome towards normalization. https://support.bioconductor.org/p/127168/ Maybe this solution with the offset matrix might help if these CNVs are very large so that you have enough counts for them. Doing so you could specifically eliminate the effect of the CNVs on the counts of your ATAC/ChIP-seq experiment. Aaron is typically very responsive if you invest some effort into your questions, so if you have a specific strategy in mind and want experts opinions you might post it at over at Bioc and hope he has a look. He is (from what I know) not active here at Biostars though. Still, these CNVs could of course be biologically-meaningful as you say, and my comment only addresses the part on how to reduce the effect of CNVs if you regard them as source of bias.
Thank you for the response! I really like the offset matrix idea -- In my case it would require normalizing dozens of different regions separately, but it sounds like it still wouldn't violate the assumptions of csaw. I'm going to try that out and give the results here and probably post in bioc as well.
It might be worth investing some time to see if the CNV regions even contain candidate peaks that have a fair chance of being differential. If these CNV only contain like 10 peaks (so low peak numbers) or even if they contain many peaks but with really low counts it might not even be worth the hustle. What I want to say is, before investing a lot of effort, try to justify that it is worth the time. I had it often in the past that I overcomplicated things.
I expect that this is a significant confounding factor in ChIP-seq analyses for cancer samples, particularly if you're doing anything with super enhancers. It's not published, but I can say that a not insignificant portion of "cancer" super enhancers are due to amplifications, which is somehow glossed over in most publications. Good on ya for thinking about this, though as mentioned, it is a challenge to deal with using standard tools.
Thank you so much for your perspective on this -- after digging further into the genes with genuine CNVs, the list was actually quite small (around 500 or so). When I simply removed these genes from all my analyses, the results really didn't change much. This saved me a ton of time -- thanks again!