0
1
Entering edit mode
4.0 years ago

Hi,

I've a specific enriched DNA-seq library to analyze ( 2x76 bp sequenced on a NextSeq500).

The library is defined as :

R1                                                  R2
==============>-----------------<===========#####@@@@@

=== : DNA fragment (should correctly align to the genome)
### : barcode
@@@ : some random sequence we introduce to increase the library complexity


Important things to know :

• barcode and the random sequence have always the same length (12 and 14 respectivelly)
• Each pair of reads have different barcode (only PCR duplicates should have same barcode and read sequences)

My goal is to remove the barcode and the random sequence from R2 but also from R1 as R1 and R2 could overlap if the DNA fragment to sequence is small (less than 2x76 = 152 bp).

Example of R1 and R2 overlapping. In this case R1 contains sequence from the barcode

R1 =====================>
||||
R2      <===========#####@@@@@


Is there some tool to handle such cases. My first idea would be to write some R script to extract the barcode and random sequence and to align them against R1 in a local manner..

0
Entering edit mode

Not what you are asking for, but chances are that you don't actually have to remove this and can just align it, and it will get soft-clipped.

0
Entering edit mode

Yes I know but it would be nice to have clean reads for further analysis ;)

0
Entering edit mode

I think you can use cutadapt, if I'm not mistaken it'll remove the #### and following nts from R1

0
Entering edit mode

yes but in this case each read will have a different adapter to trim.

0
Entering edit mode

You can give only the #### sequence as an input to cutadapt and allow it to be anywhere along the sequence and request only the following sequence.

0
Entering edit mode

yes but each read will have a different #### sequence .

0
Entering edit mode

Oh, I skipped this part when first reading :). Good chances you'll end up coding it.