Question: Searching for the conserved pattern containing a barcode of various length
4.2 years ago
mazepago0 wrote:

Hi all,

I wanna to look for a pattern in the sequence, that would contain a conservative flanks and a wildcard piece inside of variable length.

In particular, I am checking the RADseq paired end data and looking for the short loci aiming to trim off the ligation_adapter from R1 and the cut_site_1-barcode-ligation_adapter from the R2.

Such reads  look like this:

R1: cut_site_1-NNNNNNNNNN-cutsite_2-ligation_adapter

R2: cut_site_2-NNNNNNNNNN-cut_site_1-barcode-ligation_adapter

The problem is with trimming of R2: there is a conserved cut_site_1 & ligation_adapter sequences, but also there are 96 different types of barcodes, which sequence can be 4-8 bp long. I think I should use the wildcard, but how to specify a wildcard with varying length at the same time?







sequence
written 4.2 years ago by mazepago0
4.2 years ago
Cardiff University
Daniel3.8k wrote:

A regex like this should work if I understand you correctly, and will look for A, C, T or G repeated between 4 and 8 times. 


 You can tweak the visual representation of this here (great tool!)



written 4.2 years ago by Daniel3.8k
