Matching Strings With Mismatches
11.0 years ago
Krisr ▴ 470

I am using perl to match short nucleotide sequences against fasta sequences...

(GeneFasta =~ /searchSeq/g) I would like to perform this match, but allow for a mismatch in the search. Does anyone know if, and how, perl may accomplish this? perl sequence • 14k views

this is a bad idea. Why don't you use a short reads aligner instead?

The Bio::Grep module is pretty good as it provides a common interface for you to interact with several different fuzzy matchers, my favorite being Vmatch

agrep (i.e., approximate grep) is a nice tool for this sort of thing. it's not a standard LINUX tool, but it is a good one. Here's one implementation: ftp://ftp.cs.arizona.edu/agrep/ from the README at the above URL: " ...for example, "agrep -2 homogenos foo" will find homogeneous as well as any other word that can be obtained from homogenos with at most 2 substitutions, insertions, or deletions. "

Thanks. I'm impressed by the quality of this tool.

Yeah, believe it not, 3 years ago I hacked it briefly as a short-read aligner.

You are looking for a fuzzy pattern matching program, try perl module String::Approx: "Perl extension for approximate matching (fuzzy matching)" For fuzzy pattern matching excercise and scripts go through VCU bioinformatics notes on pattern matching

I've had some issues with that module - both false positives and misses.

Just assigning a regexp to a scalar will not work in perl for sub-sequence pattern matches e.g. searchSeq = "AAA[TA]";


Instead you need to use quote regular expression (qr) operator

\$searchSeq = qr/AAA[TA]/;