Question

Replacing gene name in a csv file with corresponding GO/KEGG term

1

Entering edit mode

7.2 years ago

md.rahman ▴ 10

Hi there, I have a list of gene name in a csv file (orthofinder output). I want to replace these names with respective GO or KEGG terms. Any suggestions? I was trying "genescf" but getting the following error:

 [zillur@genomics Results_Nov15]$ ./../../../genescf/geneSCF-master-source-v1.1-p2/geneSCF -m=update -i=Orthogroups.csv -t=sym -db=KEGG -o=genescf_out -p=yes -org=pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv

Error:background gene set information missing --background Since you have selected 'update' mode. It will take a while to prepare new updated database Connecting remote RUD.. processing started....Sun Feb 26 08:54:42 AST 2017 Retreiving 0 KEGG pathways for pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv Do not panic. The processing is going on... Database retreived..You are now ready to use geneSCF with organism pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv from --database KEGG Done....Sun Feb 26 08:54:43 AST 2017 =>processing in update started....Sun Feb 26 08:54:43 AST 2017 => Finished retriving database... => Calculating statistics... find: ‘pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv/class/lib/db/yes/kegg_database.txt’: No such file or directory Note:Only KEGG and Geneontology supports multiple organisms (GeneSCF-xx/org_codes_help). If you choose REACTOME/NCG database please specify organism as 'Hs'. Currently REACTOME and NCG in GeneSCF only supports Human (Hs). KEGG last updated

Example input types

gid | sym => Retreving gene list for yes from KEGG sh: pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv/mapping/DB/Orthogroups.csv_gene_list.txt: No such file or directory curl: (23) Failed writing body (2717 != 2896) => Mapping user list Can't open perl script "pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv/class/scripts/mappingIDS.pl": No such file or directory sh: pfa,pyo,pcb,pbe,pkn,pvx,pcy,cpv,cho,tgo,bbo,beq,tan,tpv/mapping/Orthogroups.csv_input_list.txt: No such file or directory Note: There were 0 genes mapped from 15068 user provided unique genes (0 %) Please cross-check your gene identifier.Sun Feb 26 08:54:45 AST 2017 finished processing

I was also trying egg_nog mapper for my orthogroups fasta file (~50000 files) but it takes eternal time.

Here is my sample input file:

        CryptoDB-29_CparvumIowaII_AnnotatedProteins     PiroplasmaDB-28_BmicrotiRI_AnnotatedProteins    PiroplasmaDB-29_TparvaMuguga_AnnotatedProteins  PlasmoDB-28_PbergheiANKA_A$

OG0000000 PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000$ OG0000001 PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_0$ OG0000002 PF3D7_0100200, PF3D7_0100400, PF3D7_0100600, PF3D7_0100800, PF3D7_0100900, PF3D7_0101000, PF3D7_0101600, PF3D7_010$ OG0000003 PBANKA_0000901, PBANKA_0001200, PBANKA_0001601, PBANKA_0007501, PBANKA_0008101, PBANKA_0100100, PBANKA_0112661.1, PBANKA_0112701, PBANKA_0$ OG0000004 TP03_0403-t26_1-p1 PCYB_001410, PCYB_001660, PCYB_005410, PCYB_006920, PCYB_101490 PKNH_0000100, PKNH_0000200, PKNH_0$ OG0000005 PCYB_001700, PCYB_002110, PCYB_002240 PF3D7_0113100, PF3D7_0115000, PF3D7_0402200, PF3D7_0424400, PF3D7_0800700, PF3D7_0$ OG0000006 PCYB_001280, PCYB_001550, PCYB_001690, PCYB_002300, PCYB_002310, PCYB_002420, PCYB_002630, PCYB_002840, PCYB_003590, PCYB_$ OG0000007 PCYB_001020, PCYB_001140, PCYB_001270, PCYB_001290, PCYB_001300, PCYB_001310, PCYB_001370, PCYB_001400, PCYB

Any help about this matter?

Best Regards Zillur

software error gene genome genescf eggnog • 2.1k views

ADD COMMENT • link updated 7.2 years ago by EagleEye 7.5k • written 7.2 years ago by md.rahman ▴ 10

score 1 · Answer 1 · 2017-02-28

1) Your input must be one gene per line in a plain text file. Check Examples here

2) It is not possible to run for multiple organisms in a way you are running geneSCF (you can use simple for looping over GeneSCF command to run for multiple organism). Always remember once you use 'update' mode, the database is already retrieved in your tool. So next time you run it, please use 'normal' mode to save time.

3) Always follow the documentation/instructions or try the test dataset provided with GeneSCF to understand how the tool works.

4) Try providing complete path for the 'input file' and 'output folder' (output folder must be created by the user and while providing the path it must end with "/").

5) If you have multiple files, always use 'prepare_database' module to first prepare your database for particular organism. Follow by using GeneSCF in 'normal' mode (because if you run 'update' mode for every files, it will take lot of time to finish the process and it is not necessary every time you retrieve database again). Check instructions for running GeneSCF on multiple files (it is simple and fun, try it).

And please open new thread for 'egg_nog mapper', so that there will be no confusion for the future referee. Also specify GeneSCF in your title of your post.