Question: A List Of Regular-Expressions For Recognizing Ids From Various Databases
12
gravatar for Will
7.7 years ago by
Will4.5k
United States
Will4.5k wrote:

I'm looking to create a list of regular expressions that can distinguish between the IDs of various databases? I know that some will be ambiguous but at least it could help narrow down which databases to check.

For example:

Kegg IDs: \w{,3}\d{1,}

Entrez IDs: \d*

RefSeq IDs: \w{2}_\d{1,}\.\d{1,}

Anyone have any to add? This might be a useful community resource.

database • 2.2k views
ADD COMMENTlink modified 2.5 years ago by Dchoy40 • written 7.7 years ago by Will4.5k
1

very useful topic

ADD REPLYlink written 7.7 years ago by Casey Bergman18k

changed to community wiki.

ADD REPLYlink written 7.7 years ago by Pierre Lindenbaum121k

Yeah, I've been trying to convert a 'mixed bag' of IDs and I had trouble even placing some of them. Hopefully this will help out.

ADD REPLYlink written 7.7 years ago by Will4.5k
9
gravatar for Pablacious
7.7 years ago by
Pablacious610
Cambridge, UK
Pablacious610 wrote:

Look at the MIRIAM registry:

http://www.ebi.ac.uk/miriam/main/collections/

they have assembled a large collection of expressions for the identifiers/accessions of number of databases.

ADD COMMENTlink written 7.7 years ago by Pablacious610
2

very cool. Thanks !

ADD REPLYlink written 7.7 years ago by Pierre Lindenbaum121k

That is awesome. I never saw that before.

ADD REPLYlink written 7.7 years ago by Will4.5k

Yeah, its cool, we use it for very much what you asked here.

ADD REPLYlink written 7.7 years ago by Pablacious610
2
gravatar for Pierre Lindenbaum
7.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:
  • dbSNP: rs[0-9]+
  • Gene Ontology: GO:[0-9]+
  • DOI (from the connotea bookmarklet): (doi:)?s?(10.d{4}/S+)
  • LSID : from http://goo.gl/D6PT1

     String legalId =  "[A-Za-z0-9][A-Za-z0-9()+,-.=@;$_!*\'\"%]*";
     String lsidRE = "^[uU][rR][nN]:[lL][sS][iI][dD]:(" + legalID + "):(" + legalID + "):(" + legalID + ")[:]?(" + legalID + ")?$";
    
ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Pierre Lindenbaum121k
1
gravatar for Casey Bergman
7.7 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

For INSDC accession numbers, we have used: ([A-Z]{1}[0–9]{5})|([A–Z]{2}[0−9]{6})|([A–Z]{4}[0−9]{8,9})|([A–Z]{5}[0−9]{7}))(\.[0–9]{1,3})

(Credit to Guy Cocharane at EBI)

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Casey Bergman18k
1
gravatar for Pierre Poulain
7.7 years ago by
France
Pierre Poulain440 wrote:

Protein Data Bank (PDB): [0-9][A-Z0-9]{3}

UniProt: [A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9] and [OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]

ADD COMMENTlink written 7.7 years ago by Pierre Poulain440
1
gravatar for Dchoy
2.5 years ago by
Dchoy40
Dchoy40 wrote:

For gene annotations in KEGG databases such as

glutamate synthase Glt1, putative; K00264 glutamate synthase (NADPH/NADH) [EC:1.4.1.13 1.4.1.14]

To extract KEGG orthology number (KO)

(^| |\)\])(K[0-9]{5})($| |\)\])

Will work with:

  1. "K00264 glutamate synthase ... " id at start
  2. "...putative; K00264 glutamate..." in in middle
  3. "Glt1, putative; K00264" id at end
  4. "(K23102 K23010)" round brackets
  5. "[K23102 K23010]" square brackets

Will exclude:

  1. " K2041020 " back-extensions
  2. " AK29310 " front-extensions

_

(^| |\)\])

captures start of a string or preceded by whitespace or has a starting round/square bracket

(K[0-9]{5})

captures the kegg id i.e. K10230, K20310. This can be replaced with the metacyc id format, etc..

($| |\)\] )

captures end of a string or followed by whitespace or has a ending round/square bracket

ADD COMMENTlink written 2.5 years ago by Dchoy40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 836 users visited in the last hour