Correlating ChIP-Seq peaks with isoform level expression
1
2
Entering edit mode
7.8 years ago

I have H3K4me3 and H3K27me3 datasets (replicates) from mouse ESCs. I generated a list of

1. active promoters (K4me3+ K27me3-)
2. bivalent promoters (K4me3+ K27me3+)
3. nonActive promoters (K4me3- K27me3-)

I am working with protein coding Ensembl annotations, where unique TSS are included and I made a promoter set of it. I also have expression profiling from mouse ESCs and appended the isoform level expression values (using Tuxedo-Cufflinks/Cuffdiff) to the above promoter sets. What I see is a bit of discrepancy and would like your inputs.

64% of my promoters in the active list (according to histone marks) have significant H3K4me3 peak on them but are not expressed according to expression dataset. So, my questions are:

1. How Cufflinks/Cuffdiff calculates and how well it performs at isoform level calculations (considering lot of overlapping transcripts with nearby TSS from same or different genes). For example, 7 protein coding transcripts of gene Lypla1, 3 are expressed but 4 is not.

chr1    4806788    4808788    Lypla1    ENSMUST00000134384    +    1.7341
chr1    4806823    4808823    Lypla1    ENSMUST00000027036    +    66.4909
chr1    4806830    4808830    Lypla1    ENSMUST00000150971    +    0.6529 #this one
chr1    4806896    4808896    Lypla1    ENSMUST00000119612    +    18.9053
chr1    4806898    4808898    Lypla1    ENSMUST00000137887    +    3.5388
chr1    4806911    4808911    Lypla1    ENSMUST00000115529    +    15.4683
chr1    4807237    4809237    Lypla1    ENSMUST00000131119    +    3.6185


2. According to this screenshot, if loaded the Ensembl transcript id (highlighted in red). This id according to expression data has a value of <1 FPKM (definitely not expressed) but there is a K4me3 sig. peak here. So, again how well are these isoform level calculations. For sure with normal ChIP protocols we cannot differentiate that his K4me3 peak is because of which trancsript but combining with expression profiling, we can depending upon how robust and credible the values are.

RNA-Seq sequencing cufflinks ensembl ChIP-Seq • 2.8k views
0
Entering edit mode

May I know how you generated the gene list with those three categories? I have having similar dataset and want to perform this similar analysis. Thanks in advance!

3
Entering edit mode
7.8 years ago
John 13k

This doesn't seem unusual to me.

From your data, isoform ENSMUST00000150971 was not expressed in your sample because the reads that were sequenced did not favour the presence of that transcript. Reads might have "fit" in ENSMUST00000150971 - they might have even fit nicely - but those reads where much better explained by the high abundance of the transcript ENSMUST00000027036, and so thats why it has the much higher FPKM (or whatever your expression units are).

Furthermore, be warned about making intersections between ChIP data and RNA data. There is a misconception in some areas of Epigenetics that you can compare RNA data from the transcriptome directly to DNA data (ChIP-Seq/WGBS/DNAse) from the Genome like they are the same thing. They are not at all the same thing. They don't even occupy in the same space in the cell. Intersections of one with the other will only cause pain, for exactly the reason shown here. There is no reason to even believe that high ChIP signal == high RNA expression, or anything like that. It does not even stand that ChIP signal should overlap with RNA expression when both are mapped to the genome.

Perhaps you would be better off looking at a total RNA count for the whole gene, then comparing to the ChIP signal at the promotor of that gene?

2
Entering edit mode

Thanks John for your comments. So, it means if the transcripts are sharing a major chunk of DNA b/w themselves (good overlaps), the read abundance calculations (FPKM) could be stochastic.

Gene level expression was my second thought, but while classifying the genome into several promoter classes, I observed there are legit transcripts that are not expressed and quite far (>2KB) from the expressed transcripts of the same gene. This could signify the alternative promoters.

Now, using gene level values, both of these will be marked expressed, while other way around it isn't. Would you still say the same thing, or a condition crafting would be better, like in the above case of overlapping transcripts, take an average but when far apart retain the respective values.

2
Entering edit mode

Yes, kind of - reads aren't assigned (to my knowledge) to isoforms by overlap alone. Overlaps is a genomic thing. For example, a read might fall into both isoform A and B. However, the 99 other reads that seemed to fit into this gene mainly fall into isoform A only. For this reason, we are far more likely to assign the first read to isoform A also, even though it overlapped with B.

The reverse is also true - imagine isoform A is a subset of isoform B. They share the first 3 exons, but not the 4th. If no reads mapping to the 4th are found, all the reads will likely be assigned to isoform A and not B.

The problem of 'where is the promotor' is a annoying one. It is so obvious when you look via the genome browser, and so difficult when you want to do it programatically. One things for sure, a static >2KB away the TSS is not the most intuitive thing to do. In fact, as I mentioned before, anything to do with overlapping RNA signal onto the genome will probably not work.

2
Entering edit mode

Thanks John, these pointers will guide my research wagon. Sometimes we can take hard decisions, like genomic classifications using static distances for a prime dynamic system, but we have to start somewhere unless much better resolution is blessed upon.

0
Entering edit mode

Yes! hahah, i have no solutions, only criticisms - and im trying to do the same thing ;)

0
Entering edit mode

H3K4me3 is associated with transcriptional activation. If there is enrichment for this histone mark, it is expected that the gene should be activated expressed. Am I right?

0
Entering edit mode

Yes, it also depends on what cell type and organism you are talking about but for most systems, H3K4me3 mark directly correlates with gene expression.