I am updating the SICER algorithm ( https://github.com/endrebak/epic ) and have added paired end support. The original SICER does not support paired end reads, but for the single end case it reduces each read to a single coordinate, which is the start of the read plus half the fragment size.
I have no paired end ChIP-Seq data myself to try it on, so I might have implemented a too naive paired end mode. It reduces each paired end read to a point by taking the leftmost and rightmost coordinate of (both the starts and ends of) a pair and finding its midpoint.
So for the (fake) read pair
chr7 20246668 20246669 chr7 20246693 20246694 U0 0 + +
the coordinate is
20246694 + (20246694-20246668)/2 = 20 246 707
It seems like this might lead to a problem though:
If the two mates are very far apart, the midpoint might be in a bad (ie heterochromatic) or uninteresting (ie blacklisted) region.
What is the best way to solve this?
I can think of two solutions, but do not understand all their up- and downsides:
1) Discard read pairs more than say 100 bp apart 2) Treat each mate in a pair as an individual read
I lean towards the first solution since it seems like the paired end libraries my users use contains much much more data than a typical single end library, and doubling that amount seems like it would be a lot of pain (waiting) for little gain (better results).