What would be a suitable regular expression for representing gene symbols of various species (homo sapiens, mus musculus, rat norwegius etc)?
Based on the HUGO Gene Nomenclature Committe FAQ: "The "symbol" is a unique series of Latin letters (upper case in human), often with Arabic numerals, which should ideally be no longer than six characters in length"
That would result in a regex like [A-Za-z0-9]{1,6} (or [A-Za-z0-9]+), but looking at some real world data, I have found gene names containing other characters as well, such as dash ("-") *, so I was wondering if you know of more such oddities that need to be taken care of?
- Seemingly this is some mitochondrial genes. The names are on the form "mt-[A-Za-z0-9"]
Probably the simplest approach is to download the gene info and gene synonyms tables from NCBI and design a regex that captures as much of that as you like.
I second Sean here. There are no static conventions for naming of gene symbols. It gets more complicated with numerous designations: culture or cell strain, splicing variants, etc. My suggestion is to find all the posible permutations of your symbols and work your
grepto catch all the symbols you need. It's a lot work, but you can test them out on a tester such as REGex TESTER.