I have a HLA based NGS data from Myseq. How to deal with the overlap in NGS data when the read one and read two of a read pair (PE) overlap more than 90 % or even they contain the same exact sequence among them? I am working on pre-processing script that goes with the pipeline already present.
Merge them, the joining program will take the nt with better quality in each position:
Is amplicon sequencing data (PCR products sequencing)? I think with DNA fragmentation is more difficult to have this problem.
After mapping PE reads, I usually soft clip the overlapping part of one of the two reads. There is a nice program for this: clipOverlap, I think it is better to clip after mapping rather than merging reads as Alvaro suggests. Also take care that if read pairs overlap by 100% some aligners might not mark them as "mapped in proper pair", whereas I think they are.