ChIP-Seq duplicates metrics/qa
1
0
Entering edit mode
7.8 years ago
Anna S ▴ 500

Hello,

BACKGROUND:

ENCODE has the following metric for duplicates:

"PCR Bottleneck Coefficient (PBC): PBC = N1/N (where N1= number of genomic locations to which EXACTLY one unique mapping read maps, and Nd = the number of genomic locations to which AT LEAST one unique mapping read maps, i.e. the number of non-redundant, unique mapping reads)

Provisionally, 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, while 0.9-1.0 is no bottlenecking."

QUESTION:

Based on this metric the datasets I'm looking at have "moderate bottlenecking." However, I wrote a script to look at the mapped duplicates themselves for one of the samples and they have the following distribution:

2 duplicates to same chr pos: 10,051 cases; 3: 2,996 chr pos, 4: 922, 5: 303, 6: 102, 7: 53, 8: 30, 9: 25, 10: 12, 11: 14, 12: 8, 13: 12, 14: 9, 15: 7, 16: 1, 17: 5

and there are 39 cases with over 17 duplicates to same chr pos, with the maximum being 143 duplicates to a pos on chrX.

My question is: do the 2, 3 and 4 duplicates, which constitute the majority of the cases, indicate pcr bottlenecking?

Ultimately we want to know whether there is a problem with the sequencing itself and does it need to be redone?

Thanks a lot for any insight!!

Anna

ChIP-Seq • 2.8k views
1
Entering edit mode
7.8 years ago
Ryan Dale 5.0k

The number of duplicates to expect in part depends on the ChIP-ed factor and the library size. As an extreme case, imagine you did ChIP-seq for a protein that binds only 10 places genome-wide, and you sequenced 10 million reads from that library. Just because of the underlying biology, in this case you'd get extremely high numbers of duplicates.

It's frustrating that to interpret a lot of QC metrics you have to know the biology . . . which is the thing you're trying to learn about with the experiment in the first place!

I've found ChIPQC (paper, BioConductor page) to be helpful. The documentation is great, and the paper is informative. It might help give some perspective on ChIP QC metrics like this one and you ideas for other metrics to try.

0
Entering edit mode

Thank you, Daler!!

I have since run the picard tool MarkDuplicates and I got the following results:

READ_PAIRS_EXAMINED 52548
PERCENT_DUPLICATION    0.297518
ESTIMATED_LIBRARY_SIZE  118144


Without the optical duplicates the duplication would be well within the "mild bottlenecking" metric. Does this mean that the sequencing parameters need to be changed (e.g. OPTICAL_DUPLICATE_PIXEL_DISTANCE)? I don't do the sequencing itself, I've just found this parameter by poking around.

Thanks!

1
Entering edit mode

I frequently get percent duplication in the range of what you're seeing, but I rarely get any optical duplicate calls when running MarkDuplicates. It could reflect something askew in the sequencing, but it's also possible that the library is still useful despite these issues. Only way to find out is to try running analysis to completion (i.e. to called peaks) to see how everything looks in the end.