Question

Is there any tool for fuzzy sequence matching?

0

Entering edit mode

4.4 years ago

marinegor ▴ 10

Is anyone aware of any tool that is able to perform error-tolerant pattern-matching search on protein FASTA files?

This is a duplicate of my question from bioinformatics.stackexchange.

For example, I want to know, which proteins in my fasta file to match ADNG..C.G regexp (which represents ADNGCG pattern). However, I want to be tolerant to matching errors, meaning that I'm good with any protein that differs in any 1 letter in the motif: ADNG..C.D, ADNG..D.G, MDNG..C.G etc are all good. Running all possible variants through grep is possible, but exponentially long for longer patterns (I usually have 15-20 letters in pattern, and scan ~10e6-7 sequences).

I am aware of the tool agrep (docs). As far as I know, it's not supported anymore, and also does not distinguish letters in which it's allowed to make error. Also, it does not support long enough patterns (with more than 9 errors -- and yes, I tried to recompile it with myself from here and could not get it work). Also, it's not designed for proteins specifically.

I am also aware of ScanProsite and Protein Pattern Find services. The first one is not error-prone, the second one is seems to be a web interface for grep.

sequence • 2.6k views

ADD COMMENT • link updated 4.4 years ago by GenoMax 141k • written 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

I'd recommend against cross-posting. You've mentioned the cross-post here, I'd recommend pasting a link to this post on Bioinfo SE as well.

ADD REPLY • link 4.4 years ago by Ram 43k

0

Entering edit mode

Done it, thanks for suggestion. And sorry for cross-posting indeed -- I have very little idea about SE and biostars audience overlap. What would be the correct way to do that?

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

Choose one site and post there only.

ADD REPLY • link 4.4 years ago by Joe 21k

1

Entering edit mode

Will go for biostars next time -- feels like several times more alive platform than SE.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

1

Entering edit mode

For bioinformatics, that's the smart choice ;)

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Are these DNA or protein sequences? You say protein but all the examples look to be DNA.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Unless I don't know something, sequences I mention are protein -- e.g. they have M, D aminoacids, which I assume are not possible in DNA sequences.

But generally, it shouldn't matter much.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

I saw mostly AGC so I assumed DNA, possibly with degenerate characters.

As you say, it won't matter really, it just changes the program you might use.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

What is the origin of the patterns of interest? Can't you create HMM profiles of these patterns?

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern. However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern.

ADD REPLY • link 4.4 years ago by h.mon 35k

0

Entering edit mode

What is the origin of the patterns of interest?

I am basically looking for certain structural features on several different transmembrane helixes in a family.

Can't you create HMM profiles of these patterns?

Well, I could, but I don't know how well are they working with long deletions (which in my case represent long loops between helixes).

You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern.

Thanks for pointing that out, I indeed have not read the documentation careful enough!

However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern

Could you share the generalized pattern documentation? I now see that this one is down, although the official prosite documentation points there.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

There is https://pypi.org/project/regex/ which allows fuzzy matching where you can specify how many "errors" you accept and what type of errors you accept.

ADD REPLY • link 4.4 years ago by massa.kassa.sc3na ▴ 600

0

Entering edit mode

Was my first shot (forgot to mention in the original post). Terribly slow :)

Also, could not figure out how to make mismatches work exactly the way I want.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

try seqkit @ marin

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

Seems to also fit, but the code explicitly warns me that more than 4 mismatches would really slow down the process -- but I'll test it anyway, thanks.

Also, are you aware of the compexity of algorithm that they use? They do not say that explicitly, and I'm not really proficient with Go to find it myself.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

@ marinegor author shenwei356 regular visitor here.

ADD REPLY • link 4.4 years ago by cpad0112 21k

score 2 · Answer 1 · 2019-11-21

2

Entering edit mode

4.4 years ago

GenoMax 141k

fuzzpro from EMBOSS should fit the bill.

ADD COMMENT • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

Fits the bill indeed, as far as I can see. Are you aware of any implementation detail and/or related publication?

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

All the documentation you could need is at the link genomax provided.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

I couldn't find the algorithm description, or at least it's assymptotic complexity estimates.

And a bummer -- fuzzpro does not allow one to make "not more than N mismatches", only "exactly N mismatches". So if I want to find sequences diverging on not more than 10, I effectively have to do for i in `seq 1 1 10`; do fuzzpro ... -pmismatch "$i" ...; done

which is sad.

ADD REPLY • link 4.4 years ago by marinegor ▴ 10

0

Entering edit mode

You might have to get creative. There are other tools out there for this kind of work, but it does feel an awful lot like a job for regex.

Perhaps take a look at, for example, https://github.com/seatgeek/fuzzywuzzy

ADD REPLY • link 4.4 years ago by Joe 21k