Question: EC number regular expression
0
gravatar for Lee Katz
6.3 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

Has anyone posted a regular expression for EC numbers?  I think this should work but I'm not 100% sure.

/\(EC (\d+(\.n?[\d\-]+){3})\)/

I also found this other page but it doesn't actually post a comprehensive regular expression for ECs.  http://www.enzyme-database.org/regex.php

ADD COMMENTlink modified 5 weeks ago by O.rka230 • written 6.3 years ago by Lee Katz3.0k
2
gravatar for _r_am
6.3 years ago by
_r_am31k
Baylor College of Medicine, Houston, TX
_r_am31k wrote:

Looks good. Any examples where the EC number might contain a '-' or an 'n'? 

Also, if I might suggest a change that matches the entire EC hierarchy:

EC\d{1,2}(\.\d{0,2}){0,3}

(Add in the '-' and the 'n' if necessary)

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by _r_am31k

I think that this regex implies that EC numbers can have fewer than four tiers, but if I run that regex, it is going to match against literally any number.  Therefore maybe it should be changed from {0,3} to {3}:

\d{1,2}(\.\d{1,2}){3}

And then to show that each site is either a dash or 1 or 2 digits (sorry, I don't have an example right now but I'm sure it exists for dashes)

\d{1,2}(\.(\-|\d{1,2})){3}

How does that look?  I took out the EC because I found examples in my text where EC doesn't exist.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Lee Katz3.0k
1

Oh, the EC is optional in your data! Boy are we gonna wade through a bunch of false positives!

The diff between our regexes is that while mine matches the entire hierarchy, including the roots (like "EC2"), yours necessitates all levels in the hierarchy. But if the data has only full EC numbers, your Regex does the job better than mine.

I googled a bit but never encountered any EC number that used a '-', but it can surely be included.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by _r_am31k
1
gravatar for O.rka
5 weeks ago by
O.rka230
O.rka230 wrote:

I was parsing some KEGG descriptors and found this to the most useful:

# this is python
import re
query='homoserine dehydrogenase [EC:1.1.1.3]'
def f(x):
    pattern = "(\[EC:)(\d+.)(\d+.)(\d+.)(\d+])"
    match = re.search(pattern, x)
    if match is not None:
        return "".join(match.groups())[1:-1]

f(query)
# 'EC:1.1.1.3'
ADD COMMENTlink written 5 weeks ago by O.rka230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1466 users visited in the last hour