Is there any tool for fuzzy sequence matching?
1
0
Entering edit mode
2.5 years ago
marinegor ▴ 10

Is anyone aware of any tool that is able to perform error-tolerant pattern-matching search on protein FASTA files?

This is a duplicate of my question from bioinformatics.stackexchange.

For example, I want to know, which proteins in my fasta file to match ADNG..C.G regexp (which represents ADNGCG pattern). However, I want to be tolerant to matching errors, meaning that I'm good with any protein that differs in any 1 letter in the motif: ADNG..C.D, ADNG..D.G, MDNG..C.G etc are all good. Running all possible variants through grep is possible, but exponentially long for longer patterns (I usually have 15-20 letters in pattern, and scan ~10e6-7 sequences).

I am aware of the tool agrep (docs). As far as I know, it's not supported anymore, and also does not distinguish letters in which it's allowed to make error. Also, it does not support long enough patterns (with more than 9 errors -- and yes, I tried to recompile it with myself from here and could not get it work). Also, it's not designed for proteins specifically.

I am also aware of ScanProsite and Protein Pattern Find services. The first one is not error-prone, the second one is seems to be a web interface for grep.

sequence • 1.5k views
0
Entering edit mode

I'd recommend against cross-posting. You've mentioned the cross-post here, I'd recommend pasting a link to this post on Bioinfo SE as well.

0
Entering edit mode

Done it, thanks for suggestion. And sorry for cross-posting indeed -- I have very little idea about SE and biostars audience overlap. What would be the correct way to do that?

0
Entering edit mode

Choose one site and post there only.

1
Entering edit mode

Will go for biostars next time -- feels like several times more alive platform than SE.

1
Entering edit mode

For bioinformatics, that's the smart choice ;)

0
Entering edit mode

Are these DNA or protein sequences? You say protein but all the examples look to be DNA.

0
Entering edit mode

Unless I don't know something, sequences I mention are protein -- e.g. they have M, D aminoacids, which I assume are not possible in DNA sequences.

But generally, it shouldn't matter much.

0
Entering edit mode

I saw mostly AGC so I assumed DNA, possibly with degenerate characters.

As you say, it won't matter really, it just changes the program you might use.

0
Entering edit mode

What is the origin of the patterns of interest? Can't you create HMM profiles of these patterns?

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern. However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern.

0
Entering edit mode

What is the origin of the patterns of interest?

I am basically looking for certain structural features on several different transmembrane helixes in a family.

Can't you create HMM profiles of these patterns?

Well, I could, but I don't know how well are they working with long deletions (which in my case represent long loops between helixes).

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern.

Thanks for pointing that out, I indeed have not read the documentation careful enough!

However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern

Could you share the generalized pattern documentation? I now see that this one is down, although the official prosite documentation points there.

0
Entering edit mode

There is https://pypi.org/project/regex/ which allows fuzzy matching where you can specify how many "errors" you accept and what type of errors you accept.

0
Entering edit mode

Was my first shot (forgot to mention in the original post). Terribly slow :)

Also, could not figure out how to make mismatches work exactly the way I want.

0
Entering edit mode

try seqkit @ marin

0
Entering edit mode

Seems to also fit, but the code explicitly warns me that more than 4 mismatches would really slow down the process -- but I'll test it anyway, thanks.

Also, are you aware of the compexity of algorithm that they use? They do not say that explicitly, and I'm not really proficient with Go to find it myself.

0
Entering edit mode

@ marinegor author shenwei356 regular visitor here.

2
Entering edit mode
2.5 years ago
GenoMax 115k

fuzzpro from EMBOSS should fit the bill.

0
Entering edit mode

Fits the bill indeed, as far as I can see. Are you aware of any implementation detail and/or related publication?

0
Entering edit mode

All the documentation you could need is at the link genomax provided.

0
Entering edit mode

I couldn't find the algorithm description, or at least it's assymptotic complexity estimates.

And a bummer -- fuzzpro does not allow one to make "not more than N mismatches", only "exactly N mismatches". So if I want to find sequences diverging on not more than 10, I effectively have to do for i in seq 1 1 10; do fuzzpro ... -pmismatch "\$i" ...; done

0
Entering edit mode

You might have to get creative. There are other tools out there for this kind of work, but it does feel an awful lot like a job for regex.

Perhaps take a look at, for example, https://github.com/seatgeek/fuzzywuzzy