I am curious about the deduplication aspect of treating the sequencing reads. So far, I did it a handful of times and always helped out in the end but I am aware that there is a debate on whether this is actually biologically correct to do or not.
What I usually do is to map the reads, get the bam file, and submit it to picard to
What I want to know are 3 questions:
- How do people deduplicate by mapping position using a psl file?
- When would you say that deduplication is too risky?
- I personally developed a tool (but there are some already) to remove duplicates by sequence identity. Without going in the details of the algorithm, I can tell you that the intersection of the removed reads between picard and my script is 99% (not 100%, though, there are some different reads). Is this approach theoretically correct?