ChIP-Seq: When reducing paired end reads to one coordinate, should I use the midpoint?
3
3
Entering edit mode
5.9 years ago
endrebak ▴ 900

I am updating the SICER algorithm ( https://github.com/endrebak/epic ) and have added paired end support. The original SICER does not support paired end reads, but for the single end case it reduces each read to a single coordinate, which is the start of the read plus half the fragment size.

I have no paired end ChIP-Seq data myself to try it on, so I might have implemented a too naive paired end mode. It reduces each paired end read to a point by taking the leftmost and rightmost coordinate of (both the starts and ends of) a pair and finding its midpoint.

So for the (fake) read pair

chr7    20246668    20246669    chr7    20246693    20246694    U0  0   +   +


the coordinate is

20246694 + (20246694-20246668)/2 = 20 246 707


It seems like this might lead to a problem though:

If the two mates are very far apart, the midpoint might be in a bad (ie heterochromatic) or uninteresting (ie blacklisted) region.

What is the best way to solve this?

I can think of two solutions, but do not understand all their up- and downsides:

1) Discard read pairs more than say 100 bp apart 2) Treat each mate in a pair as an individual read

I lean towards the first solution since it seems like the paired end libraries my users use contains much much more data than a typical single end library, and doubling that amount seems like it would be a lot of pain (waiting) for little gain (better results).

ChIP-Seq epic SICER • 1.9k views
5
Entering edit mode
5.9 years ago

What a well-timed and IMHO interesting post :-)

I am working on adding paired end (PE) support to EaSeq, and I am going through exactly the same considerations and do not have any prior experience with PE as we have not used it for ChIP at our institute (yet). EaSeq also works with a single integer coordinate internally to allow many data sets to be loaded in memory simultaneously.

Back to your question: Midpoint sounds best to me, but I am not entirely sure why you would like to filter the reads as you propose. As I understand it then some people actually use PE ChIP-seq to map some reads to areas that are normally hard to map (e.g. heterochromatic regions). So at first thought I would leave those within the data set, but I would be very happy to learn from other opinions on this.

Moreover, for many ChIP-seq experiments you would probably discard a large fraction of the reads by setting the cutoff at 100 bp. I would like to know when/why you would prefer keeping/discarding these reads - in many cases additional (but larger) pairs would give stronger statistics, but reduce resolution. However, the question is whether this is a real-life issue for most applications? From my own experience the app. 20 bp resolution that I typically get from my peakfinding algorithm (see e.g. figure 5C here) is sufficient to get really nice de novo motif discovery with MEME and good central motif distribution with Centrimo - even when working on bulk single read ChIP-seq with an experimentally determined main fragment lenght of >100 bp.

Regarding the blacklisted regions, why not just remove read pairs that overlap with these in the upstream process? Although, I can also see that this might create a strange 'event horizon' in the vicinity of blacklisted regions as an increasing fraction of the pairs are discarded. It might be preferable to keep peak-finding simple and filter peaks from blacklisted regions later on?

For EaSeq, I think that the best solution would be to rewrite the central part of the code to accomodate two (or four) coordinates when that is prefered over an efficient memory footprint. That would 1) eliminate the annoying rounding error for odd sized fragments (and potentially a 1 bp shift when exporting data again) as well as allow for downstream visualization and filtering of reads based on distance between the pairs.

0
Entering edit mode

Good points. Perhaps I should let users choose between several options:

• midpoint
• using one of the mates
• using both

In addition I'll have a cutoff that users can set themselves or use Carlo Yagues suggestion.

Implementing all should be trivial, but most users will probably be confused about what they should choose.

1
Entering edit mode

That is always the question - how much flexibility should you add on the cost of reducing user-friendlyness. What is your target group? What level of experience do you expect them to have? And how often do you believe they will run the software? And also, how central is that part of the workflow for the overall output quality?

An trained group of users using the software frequently would prefer having options, whereas non-expert people using it a few times a year would have a problem remembering what the options mean.

Ideally, you would make some tests to see how important various size cutoffs are for finding validated (or motif containing) peaks in terms of % true positives found and distances from peak to motif. If the cutoff is of minor importance or reduces output quality, then leave it out. But I know that this a lot of work to make for design choices...:-P

For the sake of simplicity, I think that I would leave out the options to use the mate coordinates unless you have experienced any scenarios where the procedure would benefit from that option.

1
Entering edit mode

I agree, IMO only the midpoints are making sense.

4
Entering edit mode
5.9 years ago

Option #2 looks bad to me, exactly for the reasons you pointed out.

Concerning option #1, the cutoff (100 bp in your example, which is very stringent IMO) could be set independantly for each library or each experiment. Indeed, the expected fragment size can vary depending of experimental procedure (sonication, size selection, ...) For instance :

cutoff = median(fragments_size) + 2*sd(fragments_size)


Finally, even if you don't own PE data, you could download some on SRA and test the robustess of your method on various datasets.

2
Entering edit mode
5.9 years ago
endrebak ▴ 900

Here is how someone added pe support to SICER previously:

For paired-end reads, the HiChIP pipeline keeps the first end and extends by the fragment length estimated from mapping positions of the two ends, rather than by the average fragment length of the library. Given the variability of fragment lengths across a complex genome like human genome, the use of actual coordinates of mapped pairs is expected to achieve better resolution in signal visualization. The bed file is then used to generate a bedGraph file by the genomeCoverageBed command from BEDTools.

Since they probably did not change the code, but rather tried to work around it, this might be just the locally optimal solution.