I am doing differential gene expression analysis by next-generation sequencing. NGS generates read duplicates and there are several programs available for the removal of such duplicates. I guess the removal of these duplicates may affect the final results as a large dynamics is an advantage of NGS over microarray. There is rare report in literature and I ask help from anyone knowing this topic well. Thanks
Read duplication may be natural (the same DNA fragment occurs and is sequenced twice) or artificial (during the sequencing procedure a copy of the same read is created and sequenced).
Some approaches are more sensitive to read duplication than others. I have also noticed that samples coming from labs with less experience with NGS library preparation typically produce very large rates of read duplications (80% or more!). Perhaps this is due to producing insufficient DNA that later needs to be amplified for the protocol.
My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones (rates are way above what a natural duplication level would be). That being said very accurate ChIP-Seq type technologies (like ChIP-Exo) could produce very high rates of natural duplicates, often undistinguishable from artificial ones.
Looking at the read distribution around high duplication sites are a way to evaluate wether that location is naturally or artificially enriched. A natural site would exhibit a smoother distribution at the site, with roughly equal number of reads on both strands. An artificial site tends to show heavy imbalances by strand, with most reads being exactly the same rather than showing a distribution around the site.
I meant "site" as a location of the genome that could produce natural duplicates. For example a binding site that may have a high level of occupancy in a ChIP-Seq experiment or a short gene that is very highly expressed. For whole genome sequencing via random DNA shearing there are some simple formulas (those that describe coverage) to estimate the likelihood of high coverage to occur. The higher the coverage the more likely that you will get natural duplicates.