handling PCR duplicate reads ?
Entering edit mode
8.0 years ago

I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq data sets. For single-end data, it is not advisable to remove duplicates just by looking at the start position of reads. Many posts/blogs I read suggests to account for the duplicate reads while counting the data instead of removing them. Anybody did this before? accounting for duplicate reads while counting (gene/exon level). Are there any tools?

One more thing I am wondering about is, if we do pre-filtering of the data, there would be some reads that will be trimed off either at the end or at the beginning due to poor quality bases. So these reads will not have similar start/end positions when mapped to genome. tools like picard MarkDuplicates would not recognise them as duplicates, even if they are PCR duplicates, as they have different start and end coordinates (in fact, different CIGAR string, due to difference in length of the read). How everyone is handling this? Assuming that PCR duplicates will not have significant effect on the end results is one way to go?

RNA-Seq ChIP-Seq • 3.9k views
Entering edit mode

It can depend quite a lot on the antibody you use. Sharp peaks have a limit number of reads that could possibly be under them, and thus reaching saturation is more likely. Deleting duplicates here would be a bad thing. On broader marks however, I would just delete them, because you'll never hit saturation (at least not in a meaningful place).

I imagine you could probably get a good estimate of the duplication rate by looking at reads with little coverage across the genome. Like, I dont know, taking all reads in regions that, when piled up, has less than X reads worth of signal (where X is the 2^c - where c is the number of pcr cycles run to make the libraries). Use these reads to determine duplication frequency. I know deeptools can correct for GC bias (something you probably want to do anyway) so maybe it can also be given a static value in addition when correcting biases. I dont know. Definitely do your filtering before trimming, for all the reasons you suggested. Maybe do it afterwards too, since you might detect "new" duplicates after trimming.

Entering edit mode
8.0 years ago

My thoughts on this:

  • For RNA-Seq I keep everything, duplicates or not, SE or PE.
  • For ChIP-Seq and similar enrichment experiments (FAIRE, ATAC etc.) I remove duplicates. In theory you do expect duplication since you sequence quite deep a smallish proportion of the genome. In practice I get better signal to noise ratio without duplicates, at least up to ~100M reads per library (mammalian genome). This is based on visual inspection, nothing sophisticated.
  • About marking single end reads, there was a discussion started by me here Mark duplicates for single end reads: Why only 5'end?

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6