Question: handling PCR duplicate reads ?
gravatar for geek_y
4.9 years ago by
geek_y11k wrote:

I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq data sets. For single-end data, it is not advisable to remove duplicates just by looking at the start position of reads. Many posts/blogs I read suggests to account for the duplicate reads while counting the data instead of removing them. Anybody did this before ? accounting for duplicate reads while counting ( gene/exon level )  Are there any tools ?

One more thing I am wondering about is, if we do pre-filtering of the data, there would be some reads that will be trimed off either at the end or at the beginning due to poor quality bases. So these reads will not have similar start/end positions when mapped to genome. tools like picard MarkDuplicates would not recognise them as duplicates, even if they are PCR duplicates, as they have different start and end coordinates (in fact, different CIGAR string, due to difference in length of the read). How everyone is handling this ? Assuming that PCR duplicates will not have significant effect on the end results is one way to go ?

rna-seq chip-seq • 2.9k views
ADD COMMENTlink modified 4.9 years ago by dariober11k • written 4.9 years ago by geek_y11k

It can depend quite a lot on the antibody you use. Sharp peaks have a limit number of reads that could possibly be under them, and thus reaching saturation is more likely. Deleting duplicates here would be a bad thing. On broader marks however, I would just delete them, because you'll never hit saturation (at least not in a meaningful place).

I imagine you could probably get a good estimate of the duplication rate by looking at reads with little coverage across the genome. Like, I dont know, taking all reads in regions that, when piled up, has less than X reads worth of signal (where X is the 2^c - where c is the number of pcr cycles run to make the libraries). Use these reads to determine duplication frequency. I know deeptools can correct for GC bias (something you probably want to do anyway) so maybe it can also be given a static value in addition when correcting biases. I dont know. Definitely do your filtering before trimming, for all the reasons you suggested. Maybe do it afterwards too, since you might detect "new" duplicates after trimming.

ADD REPLYlink modified 10 months ago by RamRS30k • written 4.9 years ago by John12k
gravatar for dariober
4.9 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

My thoughts on this:

  • For RNA-Seq I keep everything, duplicates or not, SE or PE.
  • For ChIP-Seq and similar enrichment experiments (FAIRE, ATAC etc.) I remove duplicates. In theory you do expect duplication since you sequence quite deep a smallish proportion of the genome. In practice I get better signal to noise ratio without duplicates, at least up to ~100M reads per library (mammalian genome). This is based on visual inspection, nothing sophisticated.
  • About marking single end reads, there was a discussion started by me here Mark duplicates for single end reads: Why only 5'end?
ADD COMMENTlink modified 10 months ago by RamRS30k • written 4.9 years ago by dariober11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 787 users visited in the last hour