I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq data sets. For single-end data, it is not advisable to remove duplicates just by looking at the start position of reads. Many posts/blogs I read suggests to account for the duplicate reads while counting the data instead of removing them. Anybody did this before ? accounting for duplicate reads while counting ( gene/exon level ) Are there any tools ?
One more thing I am wondering about is, if we do pre-filtering of the data, there would be some reads that will be trimed off either at the end or at the beginning due to poor quality bases. So these reads will not have similar start/end positions when mapped to genome. tools like picard MarkDuplicates would not recognise them as duplicates, even if they are PCR duplicates, as they have different start and end coordinates (in fact, different CIGAR string, due to difference in length of the read). How everyone is handling this ? Assuming that PCR duplicates will not have significant effect on the end results is one way to go ?