Question

in silico enrichment

0

Entering edit mode

9.5 years ago

kamhamea • 0

I'm looking for a fast tool to scan hight throughput sequencing data reads for patterns, so that the final thorough analysis can be performed with only a few reads that belong to the cluster.

Practicaly, I manually generated a set of oligos (~20mer) that uniquely belong to a cluster of genes, and now I'm going to find all the reads that are matching. Next step is finding neighboring reads, but already the fist step programmed based on a python find regex string routine takes weeks on whole genome seq.

sequencing alignment • 1.3k views

ADD COMMENT • link 9.5 years ago by kamhamea • 0

score 1 · Answer 1 · 2016-06-01

1

Entering edit mode

9.5 years ago

Brian Bushnell 20k

This sounds like a job for BBDuk, which can filter reads by matching kmers. It's extremely fast. For example, using 20-mers:

bbduk.sh in=reads.fq outm=matching.fq ref=oligos.fa k=20 mm=f

Note that if your oligos contain degenerate IUPAC symbols like "N" you should add the flag "copyundefined".

ADD COMMENT • link 9.5 years ago by Brian Bushnell 20k