Question

Read Duplicates

9

Entering edit mode

11.9 years ago

fanx ▴ 80

I am doing differential gene expression analysis by next-generation sequencing. NGS generates read duplicates and there are several programs available for the removal of such duplicates. I guess the removal of these duplicates may affect the final results as a large dynamics is an advantage of NGS over microarray. There is rare report in literature and I ask help from anyone knowing this topic well. Thanks

454 • 18k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 11.9 years ago by fanx ▴ 80

1

Entering edit mode

I'd refer you to this reply from seqanswers by lh3. In short, for DGE analysis, I wouldn't remove PCR duplicates. There is no way of knowing if it is a PCR duplicate or it is just because of the number of fragments that were identical. Of course paired-end helps resolve this up to a certain extent. It would be even more unlikely to have a fragment that has identical start and end.

However, most of the pipelines constructed so far deal with removal of duplicates for SNP calling and not for DGE. And I think this is the way to go. But then, I also understand this could be very subjective.

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

Thanks Arun and Istvan Albert! Both paired-end and read distribution are helpful to sovle this issue at certain level. However, for meta-sequencing without references, read distribution is not possible. I spoke several guys in the field and there is no clear answer. I think the final asnwer may depend on a series of model experiments, including the estimation of several parameters like coverage depth, initial template amount and many others.

ADD REPLY • link 11.9 years ago by fanx ▴ 80

score 17 · Answer 1 · 2012-06-18

Read duplication may be natural (the same DNA fragment occurs and is sequenced twice) or artificial (during the sequencing procedure a copy of the same read is created and sequenced).

Some approaches are more sensitive to read duplication than others. I have also noticed that samples coming from labs with less experience with NGS library preparation typically produce very large rates of read duplications (80% or more!). Perhaps this is due to producing insufficient DNA that later needs to be amplified for the protocol.

My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones (rates are way above what a natural duplication level would be). That being said very accurate ChIP-Seq type technologies (like ChIP-Exo) could produce very high rates of natural duplicates, often undistinguishable from artificial ones.

Looking at the read distribution around high duplication sites are a way to evaluate wether that location is naturally or artificially enriched. A natural site would exhibit a smoother distribution at the site, with roughly equal number of reads on both strands. An artificial site tends to show heavy imbalances by strand, with most reads being exactly the same rather than showing a distribution around the site.

Ram · Answer 2 · 2012-06-18

2

Entering edit mode

11.9 years ago

Istvan Albert 100k

I meant "site" as a location of the genome that could produce natural duplicates. For example a binding site that may have a high level of occupancy in a ChIP-Seq experiment or a short gene that is very highly expressed. For whole genome sequencing via random DNA shearing there are some simple formulas (those that describe coverage) to estimate the likelihood of high coverage to occur. The higher the coverage the more likely that you will get natural duplicates.

ADD COMMENT • link 11.9 years ago by Istvan Albert 100k

0

Entering edit mode

Hi Istvan!

Could you point to some of the formulas you mentioned above? Today I got a request to determine if the duplicates in some high duplicate samples are artifacts, and your answer if very helpful!

Thanks a lot,
Anna

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Anna S ▴ 510

2

Entering edit mode

Look for the Lander Waterman equation and you'll find the formula of coverage distribution. That being said it is usually way too optimistic and would only be valid for random shotgun sequencing (not chip-seq or rna-seq). Natural duplicates tend to ratchet up and down in smooth patterns, like steps on both sides of a highly covered region. Artificial duplicates shoot up as huge tower in just a single location.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Istvan Albert 100k