Question: Understanding Chipseq Data - Unique Mapped Reads.
1
5.0 years ago by
Affan290
Affan290 wrote:

I am currently a first year grad student learning bioinformatics. I come from a purely mathematical background and I am learning biology and biological terms/functions as I go along (this can be challenging).

In a paper I am reading titled "A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes" by Bonneville and Jin, I am a bit confused about what something they said. Let me give you some background: the paper talks about two cell lines (MCF7, MCF7-T) where the MCF7-T is the cell line that is resistant to Tamoxifen (a breast cancer medicine). They are using Hidden Markov Models to identify and analyze epigenetic patterns in these cell lines.

Question:

In section 2.1, they divide each chromosome in the entire genome (hg18) into 1000 bp bins. For each dataset, the unique matched reads were assigned to a bin according to their 5' end. They determined the mark status for each bin for each of the eight datasets as either 0 (non-mark) or 1 if for mark if the number of reads in the bin is sufficient such that P < 10^-4. They tell us to look at the supplementry data which is below:

``````+---------------------------+-----+-------------+-------------+
|           Mark            | ID  |    MCF7     |   MCF7-T    |
+---------------------------+-----+-------------+-------------+
| ERα/E2                    |   1 | 8,852,289   | 4,445,403   |
| ERα/Control               |   2 | 1,731,183   | 5,161,146   |
| PolII/E2                  |   4 | 6,782,648   | 5,712,811   |
| PolII/Control             |   8 | 6,439,650   | 7,434,175   |
| H3K4me2                   |  16 | 2,417,878   | 6,650,379   |
| H3K27me3                  |  32 | 49,900,845  | 53,481,772  |
| H3K9me2                   |  64 | 4,916,106   | 5,228,383   |
| DNA methylation (MBD-seq) | 128 | 27,395,480  | 23,725,331  |
| Total                     |     | 108,436,079 | 111,839,400 |
+---------------------------+-----+-------------+-------------+
``````

My question(s) are related to the bolded parts.

1) What are unique mapped reads? If I understand correctly, Chip-Seq for Erα and ER2 will yeild the DNA sequence of where the transcription factor and the DNA is bound. Then these are split into "reads" which are short (often overlapping) sequences by a NGS machine. Does the table imply that for every region where ERα is bound, you take all the reads and add it together for ALL the regions and together they are ~8 million reads? Can someone help me understand this better?

2) What is the mark status? If I understand correctly, a Mark status is just saying what we Chip-Seq'ed? (note that you may not have to worry about ID as that comes later in the paper).

3) What are the 1000bp bins? If I understand correctly, there are ~8 million reads for Mark 1. Those reads (by sequence alignment) will fall under some bin.

chipseq • 2.5k views
modified 5.0 years ago by Devon Ryan88k • written 5.0 years ago by Affan290
0
5.0 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:
1. There are actually two definitions I've seen floating around. The definition I prefer is that an alignment is considered unique if the next best alignment has a lower score than it (there are various ways of scoring an alignment and the parameters are user tunable, though most people don't touch them). The other definition that you might run into is that an alignment is only considered unique if there are no other "valid alignments" (where an alignment is considered valid if its score is above some largely arbitrary threshold). You actually can't determine which of these the authors used without looking at their data yourself.

2. This is probably easiest explained with an example. Suppose we're looking at H3K4me2 in some bin. H3K4me2 is considered the mark, since it's presence in a given area can mark that area as an active gene promoter (the word "generally" might be an over-statement). The status, then, is whether the p-value <10e-4 in that bin. If it is, then they're just saying that a histone 3 at that position is likely to be dimethylated on lysine 4. The same holds for the other marks.

3. The 8 million reads for mark1 a the total number for the whole genome, not just one bin. A bin is just a stretch of 1000bp on a chromosome. The bins are simply tiled over each chromosome. If it helps, you can think of the bins as being pairwise disjoint sets of 1000 sequential elements. The datasets will be pretty sparse, with most bins being empty and a relative few bins containing the majority of uniquely mapped reads.

BTW, a general critique of assigning 0/1 status to some bin/region is that, at the population level, things are almost never actually that clear and you're often better off just dealing with the data as-is rather than trying to force everything into being nice and clean like that.

1

Thanks, For number 3 just so I understand correctly: They took the genome from hg18 and separated into 1000bp bins. So if each dot represents some bp, we have .....|.......|.......| where the "|" separates them. Now, say for mark1, we look at the Chip-Seq data. Chip-Seq gives us "reads" which are just short fragements. We see where these fragments align with in the bins, and so thats why you say that most bins will be empty, correct?

Exactly. In fact I should add the the goal of Chip-seq is to have relatively few bins contain any reads, since you expect your "mark" to be primarily localized to a few regions (this is much less the case for MBD-seq, but that's a quite different thing all together)

Okay, so suppose I have a bin. |............|. Underneath this bin I will likely try to align my reads, I could potentially have a LOT of them. So how do I tell that these reads express the mark (like in your point #2)?

Just for clarification, the reads are mapped to the whole genome, not a bin at a time (perhaps you knew that, but it's better to clarify it at the outset). The bins are just post-hoc arbitrary partitions of the genome.

It's not that the reads express the mark, but rather that the presence of the mark in an area will increase the number of reads that will map there (if you're unfamiliar with how Chip-seq works, find someone local to draw it out for you on a whiteboard, that's probably the easiest way to understand it). The enrichment of reads mapping to an area is imperfect for a variety of regions. For this reason, one would normally run multiple replicates and negative controls (to gauge the background alignment rate) and use that together to declare the presence/absence of a mark in an area, which seems to not have been done in this paper in all cases (there are at least 2 controls).