I've a specific enriched DNA-seq library to analyze ( 2x76 bp sequenced on a NextSeq500).
The library is defined as :
R1 R2 ==============>-----------------<===========#####@@@@@ === : DNA fragment (should correctly align to the genome) ### : barcode @@@ : some random sequence we introduce to increase the library complexity
Important things to know :
- barcode and the random sequence have always the same length (12 and 14 respectivelly)
- Each pair of reads have different barcode (only PCR duplicates should have same barcode and read sequences)
My goal is to remove the barcode and the random sequence from R2 but also from R1 as R1 and R2 could overlap if the DNA fragment to sequence is small (less than 2x76 = 152 bp).
Example of R1 and R2 overlapping. In this case R1 contains sequence from the barcode
R1 =====================> |||| R2 <===========#####@@@@@
Is there some tool to handle such cases. My first idea would be to write some R script to extract the barcode and random sequence and to align them against R1 in a local manner..