I got a list of kegg gene ID for fruitfly. Now I want to get these sequences. I think mapping these KEGG gene ID to flybase gene ID and then extracting the sequences from flybase is a good way to do the job. I am sure there is a map between kegg gene id and flybase gene id (see below), however, I do not know where it is. Give me a hint? Thanks!
The CG identifiers are not Kegg gene ids but the FlyBase Computed Gene IDs. The CG id nomeclature was inherited from Celera/Berkeley and then gene model names were rationalized into the FlyBase paradigm, where all object types are given a FBxx id, where FB=FlyBase and xx=object type (in this case, gene name).
If you have KEGG gene IDs, and you want to get the sequences, why not simply download the sequences from KEGG and save yourself the painful and error-prone mapping exercise?
You can download them all from the KEGG FTP site: ftp://ftp.genome.jp/pub/kegg/genes/organisms/dme/
You are accessing the FlyBase database directly, right?
CG10219 and CG10320 are synonyms in FlyBase, which you can use to get the FlyBase gene ID as follows:
SELECT DISTINCT f.uniquename FROM feature f, synonym s, feature_synonym fs WHERE s.name = 'CG10219' AND s.synonym_id = fs.synonym_id AND f.feature_id = fs.feature_id AND f.organism_id = 1;;
- you need DISTINCT because there are several mappings in feature_synonym that match the same synonym_id and feature_id, but differ in pub_id (otherwise it will just return the FlyBase ID a couple of times)
- you need to match organism_id because otherwise you get non-dmel results too