I would like to ask for help regarding an issue we have with NGS data.
We are performing a ChIP-seq experiment. We have two replicate libraries that were prepared independently with a NEBNext Ultra II FS Library prep kit for Illumina and PE sequenced. Mapping is working as expected but when removing duplicates we have around 80-90% duplication. Some degree of duplication was expected because they are low input libraries, thus not very complex. However, when checking visually those duplicates in the browser, we could observe that they form very sharp peaks of 3-4 reads duplicated a lot and, what was very surprising, most of those reads have exact duplicates in the other replicate. With this I mean that in both replicates, which were fragmented independently, we have exact same fragments from 3' to 5' end (it is paired ends, so it is not only the actual read, but the whole fragment that is exactly the same). I attach a picture matching some.
The only explanation I can find is that the fragmentation of this kit is not random at all. Has someone experience with this kit? Do you have any other possible explanation to this phenomenon?
Our problem is that if between replicates exact reads were generated by non random fragmentation, the same should be happening inside each one, and thus when removing duplicates we are losing enormous amounts of information.
Thanks in advance for the help,
I didn't specify it in the original post, but the fragmentation is part of the kit workflow and it is enzymatic. They don't provide in the documentation the enzyme mix that is being used. I asked them and they only told me that a nicking enzyme is part of it.
For what I have been investigating, I think they use Nt.CviPII to make a lot of nicks in both strands all along the genome and a nick translating polymerase to move those nicks until one of each strand meet and create a double strand break. Two proximal DSBs will release a fragment. Apparently this system permits more reproducible fragment length independently of the quantity of DNA input and it can be modulated with incubation time.