The Enzyme Commission (EC) number is used to classify enzymes by their function (see "Enzyme Commission number"). The EC number is used by a number of enzyme databases as the primary identifier, for example:
Since EC numbers can only be applied to proteins, and the main protein sequence database is UniProtKB, you can start by searching UniProtKB with your EC number and taxon name. The you'll need to map the proteins found on the an EST database.
To start with let's see how many ESTs are in the INSDC databases, in my case I'll use EMBL-Bank in EMBL-EBI's SRS server, but GenBank or DDBJ will give the same answers:
- Go to http://srs.ebi.ac.uk/
- Click the "Library Page" tab.
- Select "EMBL" and click the "Extended Query Form" button.
- Un-check the "Use wildcards" option.
- For the "Data Class" field select the EST section (i.e. "est").
- In the "Taxon" field type your taxon (e.g. Asteraceae).
- Click the "Search" button at the top of the page.
This finds me 1,081,600 entries in EMBL-Bank matching the query. Quickly checking to see how many of these have been annotated as having a coding sequence (CDS), by linking to EMBLCDS find no entries. So the easy way of doing this by exploiting links to the corresponding protein sequences is not going to work.
Okay lets put that to one side, and have a look at the protein part of the problem. So this time I'm searching UniProtKB for the proteins from the selected taxon with the desired EC number annotation:
- Click the "Library Page" tab.
- Select "UniProtKB" and click the "Extended Query Form" button.
- In the "ECNumber" field type your EC number (e.g. "2.1.2.1").
- In the "Taxonomy" field type your taxon (e.g. Asteracea).
- Click the "Search" button at the top of the page.
Which gives two entries: P49357 and P49358.
FWIW the same query using UniProt.org is:
taxonomy:asteraceae AND ec:2.1.2.1
And gives the same two entries.
Since there is no direct relationship between the proteins and the ESTs, the next step is to download the EST sequences in fasta sequence format, and use the NCBI BLAST or FASTA suite software to perform a sequence similarity search of the protein sequences vs. the EST sequences. This will allow you to identify which ESTs match the proteins for your specific EC number and thus build the mapping between the ESTs and EC numbers.
If instead of EC number you want to relate Gene Ontology (GO) terms to the Asteraceae EST sequences then you would follow a similar process, but instead of using EC numbers to identify the proteins of interest you would use GO terms instead. In UniProt.org you would use a query like:
taxonomy:asteraceae AND go:0006094
Alternatively in SRS you would specify the GO term using the "Link" subentry to specify:
- "DbName" is "GO"
- "DBxref" is "GO:0006094"
and ensure that you request the "Entry" to be returned rather than the "Link" subentry.
In order to automate these queries see the relevant bits of documentation:
And the documentation for your preferred library, e.g. BioJava, BioPerl, BioPython or BioRuby
The title of your question is confusing. There is no such thing as a "KEGG GO id". There are GO IDs and there is KEGG. Also the title suggests that you want to map from GO (or KEGG?) to EST, but the question suggests that you want to map from EC number to EST. Please clarify.