I have downloaded a file which contains all the pathways of human. The structure of the file looks like this: What I want to do is to extract each pathway and all the genes in this pathway. I know this could be down with perl or python. But I don't know how to do this.
#ENTRY hsa00001;
#NAME T01001;
#DEFINITION KEGG Orthology (KO) - Homo sapiens (human);
#--->;
!;
A<b>Metabolism</b>;
B;
B <b>Overview</b>;
C 01200 Carbon metabolism [PATH:hsa01200];
D 3101 HK3; hexokinase 3";"K00844 HK; hexokinase [EC:2.7.1.1]
D 3098 HK1; hexokinase 1";"K00844 HK; hexokinase [EC:2.7.1.1]
D 3099 HK2; hexokinase 2";"K00844 HK; hexokinase [EC:2.7.1.1]
D 80201 HKDC1; hexokinase domain containing 1";"K00844 HK; hexokinase [EC:2.7.1.1]
D 2645 GCK; glucokinase";"K12407 GCK; glucokinase [EC:2.7.1.2]
D 83440 ADPGK; ADP dependent glucokinase";"K08074 ADPGK; ADP-dependent glucokinase [EC:2.7.1.147]
D 2821 GPI; glucose-6-phosphate isomerase";"K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D 5213 PFKM; phosphofructokinase, muscle";"K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D 5214 PFKP; phosphofructokinase, platelet";"K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D 5211 PFKL; phosphofructokinase, liver type";"K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
Yes, this works. Here I get all the kegg ids. But I need the gene symbols.
Since some Entrez geneid has multiple gene symbols (I did not include GeneSymbol retrieval for KEGG), you can convert Entrez Ids from 'KEGG_database.txt' you got from GeneSCF to genesymbols using the information from KEGG.
Example for human use http://rest.kegg.jp/list/hsa
For other organisms use http://rest.kegg.jp/list/[KEGG_organism_codes]
It is simple to download all the gene ids using geneSCF. But as you said it is hard to decide which symbols to use for mapping the ids with gene symbols. I would prefer trying to extract the gene symbols from the file that I have. Thanks anyway.