Is there any tool for fuzzy sequence matching?
1
0
Entering edit mode
2.5 years ago
marinegor ▴ 10

Is anyone aware of any tool that is able to perform error-tolerant pattern-matching search on protein FASTA files?

This is a duplicate of my question from bioinformatics.stackexchange.

For example, I want to know, which proteins in my fasta file to match ADNG..C.G regexp (which represents ADNGCG pattern). However, I want to be tolerant to matching errors, meaning that I'm good with any protein that differs in any 1 letter in the motif: ADNG..C.D, ADNG..D.G, MDNG..C.G etc are all good. Running all possible variants through grep is possible, but exponentially long for longer patterns (I usually have 15-20 letters in pattern, and scan ~10e6-7 sequences).

I am aware of the tool agrep (docs). As far as I know, it's not supported anymore, and also does not distinguish letters in which it's allowed to make error. Also, it does not support long enough patterns (with more than 9 errors -- and yes, I tried to recompile it with myself from here and could not get it work). Also, it's not designed for proteins specifically.

I am also aware of ScanProsite and Protein Pattern Find services. The first one is not error-prone, the second one is seems to be a web interface for grep.

sequence • 1.5k views
ADD COMMENT
0
Entering edit mode

I'd recommend against cross-posting. You've mentioned the cross-post here, I'd recommend pasting a link to this post on Bioinfo SE as well.

ADD REPLY
0
Entering edit mode

Done it, thanks for suggestion. And sorry for cross-posting indeed -- I have very little idea about SE and biostars audience overlap. What would be the correct way to do that?

ADD REPLY
0
Entering edit mode

Choose one site and post there only.

ADD REPLY
1
Entering edit mode

Will go for biostars next time -- feels like several times more alive platform than SE.

ADD REPLY
1
Entering edit mode

For bioinformatics, that's the smart choice ;)

ADD REPLY
0
Entering edit mode

Are these DNA or protein sequences? You say protein but all the examples look to be DNA.

ADD REPLY
0
Entering edit mode

Unless I don't know something, sequences I mention are protein -- e.g. they have M, D aminoacids, which I assume are not possible in DNA sequences.

But generally, it shouldn't matter much.

ADD REPLY
0
Entering edit mode

I saw mostly AGC so I assumed DNA, possibly with degenerate characters.

As you say, it won't matter really, it just changes the program you might use.

ADD REPLY
0
Entering edit mode

What is the origin of the patterns of interest? Can't you create HMM profiles of these patterns?

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern. However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern.

ADD REPLY
0
Entering edit mode

What is the origin of the patterns of interest?

I am basically looking for certain structural features on several different transmembrane helixes in a family.

Can't you create HMM profiles of these patterns?

Well, I could, but I don't know how well are they working with long deletions (which in my case represent long loops between helixes).

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern.

Thanks for pointing that out, I indeed have not read the documentation careful enough!

However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern

Could you share the generalized pattern documentation? I now see that this one is down, although the official prosite documentation points there.

ADD REPLY
0
Entering edit mode

There is https://pypi.org/project/regex/ which allows fuzzy matching where you can specify how many "errors" you accept and what type of errors you accept.

ADD REPLY
0
Entering edit mode

Was my first shot (forgot to mention in the original post). Terribly slow :)

Also, could not figure out how to make mismatches work exactly the way I want.

ADD REPLY
0
Entering edit mode

try seqkit @ marin

ADD REPLY
0
Entering edit mode

Seems to also fit, but the code explicitly warns me that more than 4 mismatches would really slow down the process -- but I'll test it anyway, thanks.

Also, are you aware of the compexity of algorithm that they use? They do not say that explicitly, and I'm not really proficient with Go to find it myself.

ADD REPLY
0
Entering edit mode

@ marinegor author shenwei356 regular visitor here.

ADD REPLY
2
Entering edit mode
2.5 years ago
GenoMax 115k

fuzzpro from EMBOSS should fit the bill.

ADD COMMENT
0
Entering edit mode

Fits the bill indeed, as far as I can see. Are you aware of any implementation detail and/or related publication?

ADD REPLY
0
Entering edit mode

All the documentation you could need is at the link genomax provided.

ADD REPLY
0
Entering edit mode

I couldn't find the algorithm description, or at least it's assymptotic complexity estimates.

And a bummer -- fuzzpro does not allow one to make "not more than N mismatches", only "exactly N mismatches". So if I want to find sequences diverging on not more than 10, I effectively have to do for i in `seq 1 1 10`; do fuzzpro ... -pmismatch "$i" ...; done

which is sad.

ADD REPLY
0
Entering edit mode

You might have to get creative. There are other tools out there for this kind of work, but it does feel an awful lot like a job for regex.

Perhaps take a look at, for example, https://github.com/seatgeek/fuzzywuzzy

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6