Question

Duplicates In The Context Of Deep Sequencing

4

Entering edit mode

11.5 years ago

KCC ★ 4.1k

Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. However, when samples are sequenced deeply (more than 10X), it becomes completely reasonable for reads to be duplicated. If we stick with the idea of throwing away duplicates, it effectively limits the sequencing depth to 1X.

In cases where there is a dramatic shift to higher GC from input and a strongly skewed distribution, I think clearly one feels inclined to remove the duplicates.

However, in many of the deep sequencing datasets I work with, I see very little shift to higher GC with histograms of GC content that are very nearly symmetric and quite close to the Input.

In these cases, I often feel that I should stick with leaving the duplicated reads in. On the other hand, for certain regions of the genome, I see huge numbers of tags leading to overlapping-tag counts in the tens of thousands. These seem not to represent genuine biology.

What solutions are there for us who would like to use deep sequencing, but what a prrincipled way to filter out some of these clear artifacts?

sequencing duplicates • 9.6k views

ADD COMMENT • link updated 11.5 years ago by lh3 33k • written 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

Read this post. You misunderstood the purpose of duplicate removal. As to GC bias, you can barely detect the bias by comparing the GC content of the genome and that of reads. For Illumina, typical GC bias denotes significantly lower coverage at extremely high or low GC.

ADD REPLY • link 11.5 years ago by lh3 33k

0

Entering edit mode

@lh3:Sorry to be a bother. Can you specify the aspect I am misunderstanding?

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

Removing duplicates has nothing to do with GC content. Read that post first. EDIT: don't read the first few. Finish the whole thread. There are quite a lot of information mentioned by others and myself.

ADD REPLY • link 11.5 years ago by lh3 33k

0

Entering edit mode

One of the early posts says "The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction". The first thing I say is "Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. "

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

@lh3: As to your comment about GC content. PCR duplicates cause a change in the distribution of GC content of tags. Look at this paper: http://nar.oxfordjournals.org/content/early/2012/02/08/nar.gks001.long .... "This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias". We can quibble about whether it's the most important cause, but it seems reasonable to consider it a contributor to the distribution of GC.

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

Also. I do get that there are other reasons in play like increasing sensitivity and specificity of peak calls ... at least that's what I get from this paper ... http://www.nature.com/nmeth/journal/v9/n6/full/nmeth.1985.html

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

@lh3: Also, I just wanted to say I have actually read that thread before. Rereading it now, I'm thinking I probably learned about optical duplicates from you!

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

score 5 · Answer 1 · 2012-11-29

You are talking about two largely unrelated issues: duplicate and GC bias. Removing duplicates has little to do with GC bias. For resequencing, we remove duplicates mainly to reduce recurrent errors. When the duplicate rate is high, duplicates lead to higher false positive rate for SNP calling and also for SV detection. Given ~30X paired-end data, duplicate removal barely affects non-duplicates (I showed the formula on how to compute that). For 10X single-end data, duplicate removal will only remove a small fraction of non-duplicate (the formula is even simpler).

Then Illumina GC bias. Library construction is a major cause of the bias, but it is not the only one. Sequencing machines also produce bias to a lesser extent. When the same library is sequenced on different machines/runs, you may observe GC difference sometimes. The typical way to see the GC bias is to compute the coverage in different GC bins. Because for human, only a tiny fraction of the genome are in >80% or <20% GC, you can barely see the bias by comparing GC content in different GC bins -- the distribution is dominated by the vast majority of normal regions. Nonetheless, although there are few GC-extreme regions, the bias leads to false CNV calls, breaks de novo assemblies and causes false negatives. It also makes it much harder to sequence some small genomes with extreme GC. It would be good if there were no such bias.

EDIT: You are actually talking about a third issue, weird regions in alignment, which has little to do with duplicate or GC bias. Most of these regions are caused by the imperfect reference genome and some by structural variations.