Hello,
BACKGROUND:
ENCODE has the following metric for duplicates:
"PCR Bottleneck Coefficient (PBC): PBC = N1/N (where N1= number of genomic locations to which EXACTLY one unique mapping read maps, and Nd = the number of genomic locations to which AT LEAST one unique mapping read maps, i.e. the number of non-redundant, unique mapping reads)
Provisionally, 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, while 0.9-1.0 is no bottlenecking."
QUESTION:
Based on this metric the datasets I'm looking at have "moderate bottlenecking." However, I wrote a script to look at the mapped duplicates themselves for one of the samples and they have the following distribution:
2 duplicates to same chr pos: 10,051 cases; 3: 2,996 chr pos, 4: 922, 5: 303, 6: 102, 7: 53, 8: 30, 9: 25, 10: 12, 11: 14, 12: 8, 13: 12, 14: 9, 15: 7, 16: 1, 17: 5
and there are 39 cases with over 17 duplicates to same chr pos, with the maximum being 143 duplicates to a pos on chrX.
My question is: do the 2, 3 and 4 duplicates, which constitute the majority of the cases, indicate pcr bottlenecking?
Ultimately we want to know whether there is a problem with the sequencing itself and does it need to be redone?
Thanks a lot for any insight!!
Anna
Thank you, Daler!!
I have since run the picard tool MarkDuplicates and I got the following results:
Without the optical duplicates the duplication would be well within the "mild bottlenecking" metric. Does this mean that the sequencing parameters need to be changed (e.g.
OPTICAL_DUPLICATE_PIXEL_DISTANCE
)? I don't do the sequencing itself, I've just found this parameter by poking around.Thanks!
I frequently get percent duplication in the range of what you're seeing, but I rarely get any optical duplicate calls when running MarkDuplicates. It could reflect something askew in the sequencing, but it's also possible that the library is still useful despite these issues. Only way to find out is to try running analysis to completion (i.e. to called peaks) to see how everything looks in the end.