Question

Filtering duplicates with MACS2 after re-sequencing a sample to increase read depth

0

Entering edit mode

2.8 years ago

catabuloc • 0

Hi all,

I am analyzing a ChIP-seq dataset. In order to increase read depth and obtain enough uniquely mapped reads, I decided to re-sequenced my sample and merge that dataset with the data I already have. However, I am concerned that macs2 will call more duplicates and omit them during the peak calling step. Is there anyway to avoid this? Any help will be greatly appreciated.

Thank you!

macs2 chip-seq • 1.0k views

ADD COMMENT • link updated 2.8 years ago by Ram 43k • written 2.8 years ago by catabuloc • 0

0

Entering edit mode

What is it you think should happen with the duplicates? Why shouldn't it throw them out, given that what you're looking for (more unique reads, more read depth) will be increasing regardless?

ADD REPLY • link 2.8 years ago by seidel 11k

0

Entering edit mode

Duplicates are due to PCR overamplification whereas uniquely mapped reads are reads that align to only one place in the genome as opposed to multiple. My concern is that since I re-sequenced my sample, there are going to be more reads that have the same beginning and end coordinates, causing the program to think it's a PCR duplicate and therefore will omit those reads from being called in peaks.

ADD REPLY • link 2.8 years ago by catabuloc • 0

Ram · Answer 1 · 2021-06-21

Perhaps we should clarify what you mean when you say you "re-sequenced" your sample. Did you (1) go back to your original IP and generate a new library, and then sequence that? (1 sample, 2 libraries, 2 sequencing runs, thus each PCR is independent), or (2) generate more sequence data from an existing library? (1 sample, 1 library, 2 sequencing runs). You would have less to worry about with method 1. With method 2 you're simply generating more depth on an existing sample with an already observed duplication rate - so there's nothing to be done about that. And either way, the whole point is to find an increased density of unique DNA fragments observed for a given location, so duplicates would do little to help in any real sense, as they represent already observed fragments whether from (1) or (2). Anyway, you can keep them or set a limit:

MACS provides different options for dealing with duplicate tags at the exact same location, that is tags with the same coordination and the same strand. The default is to keep a single read at each location. The auto option, which is very commonly used, tells MACS to calculate the maximum tags at the exact same location based on binomal distribution using 1e-5 as the pvalue cutoff. An alternative is to set the all option, which keeps every tag. If an integer is specified, then at most that many tags will be kept at the same location. This redundancy is consistently applied for both the ChIP and input samples.

(From: Introduction to ChIP-Seq using high-performance computing, Meeta Mistry, Radhika Khetani)