Searching for the conserved pattern containing a barcode of various length
1
0
Entering edit mode
8.2 years ago
mazepago • 0

Hi all,

I wanna to look for a pattern in the sequence, that would contain a conservative flanks and a wildcard piece inside of variable length.

In particular, I am checking the RADseq paired end data and looking for the short loci aiming to trim off the ligation_adapter from R1 and the cut_site_1-barcode-ligation_adapter from the R2.

Such reads look like this:

R1: cut_site_1-NNNNNNNNNN-cutsite_2-ligation_adapter
R2: cut_site_2-NNNNNNNNNN-cut_site_1-barcode-ligation_adapter

The problem is with trimming of R2: there is a conserved cut_site_1 & ligation_adapter sequences, but also there are 96 different types of barcodes, which sequence can be 4-8 bp long. I think I should use the wildcard, but how to specify a wildcard with varying length at the same time?

Glib

My

sequence • 1.4k views
ADD COMMENT
2
Entering edit mode
8.2 years ago
Daniel ★ 4.0k

A regex like this should work if I understand you correctly, and will look for A, C, T or G repeated between 4 and 8 times.

cut_site_1-[ACTG]{4,8}-cutsite_2-ligation_adapter

< image not found >

You can tweak the visual representation of this here (great tool!)

ADD COMMENT

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6