Using python FlashText to do pattern matching in nucleotide sequences
Entering edit mode
9 weeks ago
Leendert ▴ 20

Hi all,

I'm playing with the idea of using FlashText (instead of RegEx) to do some pattern finding in nucleotide sequences. My idea came from the massive speed up seen in the post below:

My basic idea is this; given lets say a sequence AGTCTCTCGCAGGTGCA, I want to scan through all reads in a Fastq file, and extract the coordinates where this sequence occurred. FlashText should quite quickly be able to search for this, and replace with lets say -------------, which I can then just scan through sequences and extract start and end positions of these dashes (or something in line with that concept).

But now comes the problem of FuzzyMatching. FlashText can only handle exact matches (this is where RegEx wins). And as we know very well, nucleotide sequences can mutate (insertions/deletions/base changes). If FlashText is really that fast however, maybe I could just write a function which could simulate all possible combinations of indels and base changes (similar to ie. RegEx {e=5}), and then pass all those strings to FlashText (to find in essence an exact match of all possible combinations).

My question is this: Does such an approach sound feasible, and does anyone know of any python package or software that can generate these mutations/combination strings?

python bioinformatics regex flashtext • 299 views
Entering edit mode

I don't see why indels and substitutions are relevant for pattern matching in this context?

I mean, if you have a string ATGC and you're trying to find this in a longer string, and you need to account for (as you stated) all possible combinations of indels and base changes that'd be basically looking for every possible string that can be ever created?

Entering edit mode

yes exactly, but will it actually find every single string of length lets say n=10 as a match? I don't think so (but please correct me if I'm wrong, which I might definitely be). So is it not then accounting for all possible combinations in the search, but then only finding the patterns that actually exists in the sequence reads?

Entering edit mode

If FlashText is truly that fast, then this approach will work to a point. Eventually the time taken and memory required to store all possible combinations of a sequence of even a fairly modest length will start to outstrip the time taken to actually search a file.

If you want to benchmark it yourself though, its easy to make all kmers of a given length in python with the itertools standard library: Finding 16 mer not present in GRCh38

Using the code at that link for example, the time to enumerate all 10mers is a pretty speedy half a second, but just jumping to 11mers triples the execution time. For all 12mers this is already at 7 seconds, and I am not even attempting to quantify the memory required to store all the possibilities. This doesn't consider even single INDELS however, and doing so would add 1 to the combination space making it blow up even faster - i.e. you're creating an 5^n rather than 4^n search space. Out of curiosity I just tried this with that code for a k=11, and it's an increase in computing time from 1.8s seconds to 20s.

Since itertools returns an iterator of each new combination though, its possible that you could feed FlashText the new strings one by one, which would avoid the memory issue, but suddenly you would be reading your FASTQ file over and over again and then IO would kill you.

This might be an option for very short strings, or for enormously large files where the trade off between enumerating the kmers and actually searching the file starts to become economical.

You may also be able to be clever about it and store a text file of kmers that you enumerate once for a given length, which you then just read in to memory, however that still won't be quick, and those files could also be large (and then you still have to store it in memory)


Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6