Is there any precedent for considering marked duplicates when calculating coverage in high-depth whole-exome sequencing?
1
3
Entering edit mode
4.6 years ago
Dan D 7.3k

I'm working with some high-depth whole-exome sequencing data. It's paired-end. Some samples are targeted at 200X and others are 400X average read depth across the exome. The duplicate rate is very high after alignment. I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments. However, I'm not sure about the best way to properly calculate coverage estimates over the exome territory, because I can theoretically justify both omitting and incorporating marked duplicates into the calculation.

I haven't yet found anything in the literature addressing this specific issue. I'm thinking there's probably some middle ground that factors in the territory size, number of reads, and perhaps insert size profile in the case of paired-end sequencing. Does anyone have any guidance?

EDIT: I should add that each sample has a single library preparation.

exome exome-seq coverage duplicates • 2.8k views
1
Entering edit mode

Hello Dan,

could you please tell a little bit more about your data. How was the library prep done? How high is the duplicate rate? What do you plan to do with the coverage information?

fin swimmer

1
Entering edit mode

I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments

Is that your feeling or do you have any observations to support that? Have you tried using clumpify.sh from from BBMap on this data (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files )? Was the data sequenced on patterned FC or non-patterned? Unless one uses UMI's that tag individual fragments it would be difficult to conclusively define the origin of these in practice.

1
Entering edit mode

I don't know how exome library prep works - what is the "unit" captured? exons? genes? I'll assume / guess exons. So the shorter the exon, the more likely to have false-positive PCR duplicates. The R/BioConductor package dupRadar investigates this issue examining PCR duplication by gene length. Probably the same approach can be adapted to exome libraries.

0
Entering edit mode

@finswimmer. This is a standard whole-exome library prep with tight insert size distribution around 225. The duplicate rate is showing between 25-30% from the 200X-targeted and 40-45% for the 400X targeted. I'm looking to report an accurate coverage estimate per library which appropriately considers

@genomax This is something that occurred to me as I was considering the nature of high-depth exome sequencing. Just doing the math, if you have a narrow insert size distribution and a relatively small territory due to whole-exome capture, then there's an elevated chance that truly distinct fragments will end up aligning to the same coordinates, even in a paired-end context.

I haven't done anything in-depth yet though because I wanted to see what the literature had to say to inform my investigation. I'm not seeing many results on my current search terms (various combinations of high-depth exome duplicate coverage keywords)

These were sequenced on a patterned flowcell. Agreed w/r/t UMIs.

0
Entering edit mode

@Dan Do you mind trying clumpify.sh on this data? Do both total duplicates and optical duplicate estimates.

0
Entering edit mode

Oh, right, I didn't address that in my reply. I will give it a try right now.

0
Entering edit mode

OK, I ran it and with the default of mismatch of two it found dup rates between 38-42% on a set of four paired-end datasets. After thinking on this for a bit I come back to my core premise: it's entirely possible that these are not duplicates because of the hundreds of millions of input reads and the capture territory being on the order of tens of millions of bases. It seems it requires a systematic study to answer; I'm hoping it's been done already and the results are available in publication form.

1
Entering edit mode

Hello,

I have no literature but maybe a suggestion how to get a "feeling" for it, how many false positive duplicates there are. One can create random reads for the reference and look than how many duplicates are in. I did this just for chrM as my computer power is limited.

For creating random reads one can use randomreads.sh from BBTools. I used the option coverage to simulate how many DNA copies I have at the beginning of my library prep.

randomreads.sh ref=chrM.fa coverage=1000 minlength=150 maxlength=150 mininsert=150 maxinsert=300 gaussian=t paired=t illuminanames=t


I've repeated it for coverage 10.000 and 100.000. Afterwards I used clumpify to estimate the number of duplicates:

clumpify.sh in=chrM-1000X_R1.fq.gz in2=chrM-1000X_R2.fq.gz out=R1_dedup.fq.gz out2=R2_dedup.fq.gz dedupe


Here's the result:

1.000X

• Number of duplicates: 968
• Percentage Duplicate: 0,9%

10.000X

• Number of duplicates: 90850
• Percentage Duplicate: 8,2%

100.000X

• Number of duplicates: 4946248
• Percentage Duplicate: 44,8%

Of course chrM is much to small to conclude on the other chromomes. But it should show quite good how the number of duplicate fragments heavily depends on the initial number of dna fragments.

fin swimmer

1
Entering edit mode

@Dan: You probably want to be strict (dupesubs=0) if you are looking for true duplicates. That would reduce the number down further.

@finswimmer: your result is not surprising. Small size leads to more duplication.

0
Entering edit mode

Hello genomax,

yes it's not suprising, that small reference sizes leads to more duplicates. Also smaller fragments will produce more duplicates and the higher the input the higher the duplicate rate (as one see above).

But let's take the approach above to investigate the initial hypothesis ("I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments") a little bit more.

The simulated data get's the initial fragments in the library prep. But during the preparation, we select a subset of all these fragments. The smaller the subset, the higher the chance that we capture just one distinct fragment and not the duplicates.

So, let's start with the initial coverage of 100.000X and let's select a number of reads, which result in an average coverage of 300X:

x = 11047334*(300/100000) x = 33142

As we have paired reads, we just need the half of it per direction resulting in 16571.

zcat chrM_R1.fq.gz|paste - - - -|shuf|head -n 16571|tr "\t" "\n"|bgzip -c > R1_300X.fq.gz


repair.sh in in=R1_300X.fq.gz in2=chrM_R2.fq.gz out=R1_300X_rep.fq.gz out2=R2_300X_rep.fq.gz


Check for duplicates:

clumpify.sh in=R1_300X_rep.fq.gz in2=R2_300X_rep.fq.gz out=R1_300X_dedup.fq.gz out2=R2_300X_dedup.fq.gz dedupe


That's the result:

• Number of duplicates: 88
• Percentage duplicates: 0,3%

So, in conclusion one can see, that the number of duplicates is decreased dramatically. In case of the other chromosomes we wouldn't have such high duplicate rate in the beginning, as the reference is much longer and the input is lower (I guess something about 20.000). So in the end of our library prep we will have much lower false positive duplicates.

To come back to the hypothesis:

"I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments"

I would conclude: No, the number of "simply synchronous alignments of distinct fragments" is negligible.

fin swimmer

0
Entering edit mode

Good theoretical exercise. Reasonable conclusion.

While people try to be careful in experimental biology (there must be SOPs) about controlling non-biological variations there is likely to be many sources of it. Even a single extra PCR cycle, a 5% error in pipetting, wrongly calibrated instrument etc can lead to significant effects.

Some samples are targeted at 200X and others are 400X average read depth across the exome.

This seems to indicate that the deep coverage is due to reads alone. We don't have any information about how the sample was initially targeted/obtained/prepped.

Having UMI's and/or using little/no amplification/minimum handling would be practical ways of addressing this issue. Since the results will likely be confirmed independently (sanger sequencing) it should all be well in the end.

0
Entering edit mode
4.1 years ago
marco_ul • 0

Just in case this is still a relevant issue, I just came across this tool (I did not test it yet):

https://github.com/vibansal/PCRduplicates