Tool: Gene Set Clustering based on Functional annotation (GeneSCF)
28
gravatar for EagleEye
3.0 years ago by
EagleEye4.4k
Sweden
EagleEye4.4k wrote:

Gene Set Clustering based on Functional annotation

======================================================================

GeneSCF serves as command line tool for clustering the list of genes given by the users based on functional annotation (Gene Ontology, KEGG, REACTOME and NCG 4.0). It requires gene list in the form of Entrez Gene IDs or Official gene symbols as a input. GeneSCF supports multiple organisms from V1.1. Examples to download database as simple text file using GeneSCF "prepare_database" module, 1) E.coli 2) Sheep , 3) General usage

The advantage of using GeneSCF over other enrichment tools is that, it performs enrichment analysis in real-time (v1.1 and above) by accessing source databases. With command-line versions of tools, as you know you can run multiple gene list simultaneously.

Please follow GeneSCF news section to get latest updates on GeneSCF.

======================================================================

Home page:
http://genescf.kandurilab.org/

Requirement:
GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint,Cent OS and Windows 10 bash (version 1607). Other distributions of Linux might work as well.

Documentation:
http://genescf.kandurilab.org/documentation.php

Cite using:
Subhash S and Kanduri C. GeneSCF: a real-time based functional enrichment tool with support for multiple organisms. BMC Bioinformatics 2016, 17:365, http://www.biomedcentral.com/1471-2105/17/365

Report issues in Biostars or GitHub Project page. Please check FAQs before reporting.


Discussions on Biostars
GeneSCF gives out more pathways for genes compared to DAVID
More .... https://www.biostars.org/local/search/page/?q=genescf


======================================================================

Advantages

  • Real-time analysis, do not have to depend on enrichment tools to get updated.

  • Easy for computational biologists to integrate this simple tool with their NGS pipeline.

  • GeneSCF supports more organisms.

  • Enrichment analysis for Multiple gene list in single run.

  • Enrichment analysis for Multiple gene list using Multiple source database (GO,KEGG, REACTOME and NCG) in single run.

  • Download complete GO terms/Pathways/Functions with associated genes as simple table format in a plain text file (Check "Two step process" below in "GeneSCF USAGE" section).

======================================================================

Get organism codes for GeneSCF run

KEGG: Second column from the following link. For human 'hsa' and Mus Musculus 'mmu'.

http://rest.kegg.jp/list/organism

Gene Ontology: Use "id" from the following link. Example for human "goa_human" and "mgi" for Mus Musculus.

http://www.geneontology.org/gene-associations/go_annotation_metadata.all.json

======================================================================

Comparison (updated on Tue Jul 26 16:01:08 CEST 2016)

enter image description here

For more comparisons please check GeneSCF article (Fig. 6).

======================================================================

GeneSCF USAGE

Example (using GeneSCF v1.1 and above)

I will use example for Mus musculus assuming you got Entrez geneids,

Single step process,

Gene Ontology - Biological Process (Downloading current available database for Mus Musculus from Gene Ontology + enrichment analysis)

./geneSCF -m=update -i=INPUTgene.list -t=gid -db=GO_BP -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

The above command downloads complete GO db as simple text file in following location, 'geneSCF-tool/class/lib/db/mgi/' and also do enrichment analysis parallel. The results for enrichment analysis can be found in folder 'ExistingOUTPUTfolder'.

No need for running update mode for consecutive runs since GO database for Mus musculus got updated when you use 'update' mode on first run.

Gene Ontology - Cellular Component

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_CC -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

Gene Ontology - Molecular Function

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_MF -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

Gene Ontology - Complete (BP+CC+MF)

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_all -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000


Two step process,

Downloading current available database for Mus Musculus from Gene Ontology

./prepare_database -db=GO_all -org=mgi

The above command downloads complete GO db as simple text file in following location, 'geneSCF-tool/class/lib/db/mgi/'.

Gene Ontology - Biological Process

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_BP -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

Gene Ontology - Cellular Component

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_CC -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

Gene Ontology - Molecular Function

./geneSCF -m=normal -i=INPUTgene.list -t=gid -db=GO_MF -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

Gene Ontology - Complete (BP+CC+MF)

./geneSCF -m=normal -i=INPUTgene.listt -t=gid -db=GO_all -o=/ExistingOUTPUTfolder/ -org=mgi --plot=yes --background=15000

The results for enrichment analysis can be found in folder 'ExistingOUTPUTfolder'.


The above mentioned parameters should be changed according to your data (following can be altered),

-t=sym (for Gene Symbol as input list)

-t=gid (for Entrez Geneid as input list)

--background=#NUM (Use the total number of background genes from your dataset, example you can use total number of protein coding genes with detectable expression level irrespective of their significance or if it is transcriptome/Genome wide study you can use total number of annotated protein coding genes as background)

More information please refer documentation, http://genescf.kandurilab.org/documentation.php

======================================================================

Instructions for running batch analysis (Supported above GeneSCF v1.1 patch release 2 - GeneSCF v1.1-p2)

  • Edit script './geneSCF-master-source-v1.1-p2/geneSCF_batch' for your input files (files_path) and output path (output_path).

files_path="/FOLDER/WHERE/GENE_LISTS/STORED"

output_path="/FOLDER/PATH/FOR/OUTPUT"

  • Edit file './geneSCF-master-source-v1.1-p2/db_batch_config.txt' to configure your parameters for batch run.

  • Execute [genescf_path]/geneSCF-master-source-v1.1-p2/geneSCF_batch.


Note:

  • Recommended to keep all input files in same folder.
  • Inside specified output folder path GeneSCF will automatically create individual sub-folders for each gene list.

======================================================================

ADD COMMENTlink modified 4 months ago • written 3.0 years ago by EagleEye4.4k

This works for plants ?

ADD REPLYlink written 3.0 years ago by pixie@bioinfo1.1k

Now GeneSCF v1.1 supports multiple species/organisms. Check out.

ADD REPLYlink modified 13 months ago • written 14 months ago by EagleEye4.4k

Hello, I have several questions here.

First, have you compared your tool to DAVID? I know that DAVID uses a bit outdated GO annotation, but still are we about to get more precise annotation with geneSCF?

Second, does your software allow integrating of annotation from different sources. I mean there are certain pieces of information that are in different annotation sets that are missing and therefore annotation sets could complement each other.

Third, any details, besides source code, how the clustering is performed. I've seen the term EASE in your wiki, which I believe is a clustering algorithm implemented by DAVID itself. How is this incorporated into your framework?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by mikhail.shugay3.2k

Hi mikhail.shugay

I hope answered for all your question. Please let me know if you have more questions.

/santhilal

ADD REPLYlink written 3.0 years ago by EagleEye4.4k

Dear EagleEye, Hi

Is it possible toy use your tool in the case of de novo transcriptome assembly Clustering of Functional annotation ?

I have a fish de novo assembly (transcripts) and I don't know how to use geneSCF for my data (I usually use Blast2GO for annotation and then WEGO for visualization).

~ Thanks

ADD REPLYlink written 8 months ago by Farbod3.0k

GeneSCF works only with simple gene list from known organism or species covered by KEGG and GeneOntology (GO). If you have gene list predicted from your analysis, you can use one of the organism from the below links close to your organism (fish) as model to predict function or perform enrichment analysis.

For,

Organisms/species covered by KEGG

Organisms/species covered by GO

Also the links are provided in the post under heading 'Get organism codes for GeneSCF run'.

ADD REPLYlink modified 8 months ago • written 8 months ago by EagleEye4.4k

Dear EagleEye,

I guess the most annotated organism close to my species in zebrafish.

I can blast my transcriptome.fasta against ENSEMBL zebrafish genes ( or proteins) database and collect the related gene (in zebrafish).

1- can I use the data I have described in GeneSCF ?

2- imagine that I have this Gene list, what is the simple script for running GeneSCF for my data ?

Thank you

ADD REPLYlink written 8 months ago by Farbod3.0k
  1. YES. You can use your list of collected related genes from zebra fish (Danio rerio) in GeneSCF.

  2. GeneSCF commandline for zebra fish (Danio rerio), organism code is 'dre' from KEGG.

    ./geneSCF -m=update -i=INPUTgene.list -t=gid -db=KEGG -o=/ExistingOUTPUTfolder/ -org=dre --plot=yes --background=15000

  • 'INPUTgene.list' is a file with list of your genes.

  • 'ExistingOUTPUTfolder' already existing folder where your output to be stored.

  • For background instead of 15,000 use th total number of genes found by transcriptome assembly.

  • '-t=gid' should be changed according to your input type.

  • Also check system requirements for running GeneSCF.

ADD REPLYlink modified 8 months ago • written 8 months ago by EagleEye4.4k

How much does it take to run? Is there any benchmarking? It took 5 minutes to analyze 2000 genes. How does it scale?

ADD REPLYlink written 7 months ago by Lluís R.500

Sorry for my late response (On vacation :-))

Simulations performed on March 2016 using GeneSCF v1.1

ADD REPLYlink modified 7 months ago • written 7 months ago by EagleEye4.4k
3
gravatar for EagleEye
3.0 years ago by
EagleEye4.4k
Sweden
EagleEye4.4k wrote:

Hello mikhail.shugay,

1) Yes I have compared the results from geneSCF and DAVID on different experimental data. I will be soon including more statistical information for comparisons in my documentation and improve the results of tool by including graphical presentation. But as you requested I will include only basic comparison made.

Cell Cycle comparison graph

 

 

 

 

 

 

 

2) This tool works independent of each annotation (database). You will get results from the database which you mention in the parameter '--database='. So this is not actual integrating (like pileup different database) as you think and sorry if I made any impression like that in any of my documentation. I hope that understood you question and answered it properly.

3) EASE is an scoring method (to get how significant that a set of genes belongs to  corresponding process) used in this tool which was implemented in DAVID. I have given the reference (link) to this EASE score and you can find better explanation with an example there.

I hope I have answered all you questions and please let me know if there is any information you did not understand from my explanation.

Your suggestions are always welcome.

Regards,

/Santhilal

 

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by EagleEye4.4k
0
gravatar for jan
16 months ago by
jan60
Malaysia
jan60 wrote:

Hi i'm getting this error when I tried to use your program

Illegal division by zero at /geneSCF-master/geneSCF-master-v1.0/class/lib/List/Vectorize/lib/List.pl line 599, <IN2> chunk 1.

This is the command that I used

perl geneSCF --infile=test_entrezID.txt --gytpe=gid --outpath=~/EC_gene --database=GO_all

my input

5243
8647
1244
10257
25
27
ADD COMMENTlink written 16 months ago by jan60

./geneSCF --infile=test_entrezID.txt --gtype=gid --outpath=~/EC_gene/ --database=GO_all

Please rewrite the command as above.

Also instead of 'gtype' you wrote 'gytpe'

ADD REPLYlink modified 14 months ago • written 16 months ago by EagleEye4.4k

I also getting same error when I run following command using same above input file test_entrezID.txt :

./geneSCF -m=normal --infile=test_entrezID.txt --gtype=gid --outpath=test/output/ --database=GO_all --plot=yes -bg=20000 -org=goa_human

also tried with -m=update

Illegal division by zero at /geneSCF-master-source-v1.1/class/lib/List/Vectorize/lib/List.pl line 599, <in2> chunk 1.

ADD REPLYlink modified 13 months ago • written 13 months ago by Mike660

Is it possible to provide few lines of your input gene list ?

ADD REPLYlink modified 13 months ago • written 13 months ago by EagleEye4.4k

Check whether you have provided the gene lists in proper format, I am afraid there might be problem with your input. Examples for preferred formats are in this link

https://github.com/santhilalsubhash/geneSCF/tree/master/geneSCF-master-v1.0/test

Entrez GeneID format: sample_gene_list_id ( format supported by --gtype=gid )

Gene symbol format: sample_gene_list_sym ( format supported by --gtype=sym )

ADD REPLYlink modified 13 months ago • written 13 months ago by EagleEye4.4k

I used same input as in above example: test_entrezID.txt

test_entrezID.txt

5243
8647
1244
10257
25
27
ADD REPLYlink written 13 months ago by Mike660

Please try to use full path or if it is in the current directory use,

./geneSCF -m=normal --infile=./test_entrezID.txt --gtype=gid --outpath=./test/output/ --database=GO_all --plot=yes -bg=20000 -org=goa_human

Let me know if it is working.

ADD REPLYlink modified 13 months ago • written 13 months ago by EagleEye4.4k
cd geneSCF-master-source-v1.1

ls

GeneSCF-Documentation_v1.1.pdf class                          gpl-3.0.txt                    org_codes_help                 test
README.txt                     geneSCF                        mapping                        prepare_database               test_entrezID.txt

and this is command & same error

Illegal division by zero at /geneSCF-master-source-v1.1/class/lib/List/Vectorize/lib/List.pl line 599, <in2> chunk 1.

./geneSCF -m=normal --infile=./test_entrezID.txt --gtype=gid --outpath=./test/output/ --database=GO_all --plot=yes -bg=20000 -org=goa_human
ADD REPLYlink written 13 months ago by Mike660
1

Can you please try the tutorial on test dataset from GeneSCF v1.1 by redownloading the tool?

http://genescf.kandurilab.org/downloads.php

http://genescf.kandurilab.org/documentation.php

I also want to point out that GeneSCF only works on LINUX system, it has been successfully tested on UBUNTU, MINT and CentOS. Other distributions of Linux might work as well.

ADD REPLYlink modified 13 months ago • written 13 months ago by EagleEye4.4k

Thanks for your prompt response, actually I was trying on mac os terminal.

ADD REPLYlink written 13 months ago by Mike660

We are sorry for not being clear, soon we will update the information in documentation and in the website as well. The future version of GeneSCF can be made available for OSX operating environment (current versions only works on Linux environment).

ADD REPLYlink written 13 months ago by EagleEye4.4k
0
gravatar for md.rahman
5 months ago by
md.rahman0
md.rahman0 wrote:

Hi there, I was trying to run genescf. 1st I tried with my raw csv file for multiple organisms. But getting errors. I have changed my csv format to text , each line one gene name and ran it for single organism. Still getting the error. anybody can help me? Is there any way to run it for multiple organisms?

Best Regards Zillur

 ./../../../../genescf/geneSCF-master-source-v1.1-p2/geneSCF -m=update -i=test_2.txt -o=genescf_out -db=GO_all -p=yes -bg=70379 -org=pfa

error:

gzip: /home/zillur/Desktop/zillur/phd/orthofinder/genescf/geneSCF-master-source-v1.1-p2/class/lib/db/pfa/gene_association.pfa.gz: unexpected end of file

cat: /home/zillur/Desktop/zillur/phd/orthofinder/genescf/geneSCF-master-source-v1.1-p2/class/lib/db/pfa/gene_association.pfa: No such file or directory cat: /home/zillur/Desktop/zillur/phd/orthofinder/genescf/geneSCF-master-source-v1.1-p2/class/lib/db/pfa/gene_association.pfa: No such file or directory Updating gene information... Do not panic. The processing is going on..

Illegal division by zero at /home/zillur/Desktop/zillur/phd/orthofinder/genescf/geneSCF-master-source-v1.1-p2/class/lib/List/Vectorize/lib/List.pl line 599, <IN2> chunk 1.

Tue Feb 28 02:21:18 AST 2017 finished processing

ADD COMMENTlink written 5 months ago by md.rahman0

Please provide complete path to input file (single column file containing one geneid/symbol per line) and output path/folder (output path should end with "/").

Always check organism codes before running geneSCF. The code for your organism is "GeneDB_Pfalciparum".

Note: For organism codes please check the link provided in the documentation or org_code_help folder and link provided in the files for getting updated Organism codes.

And refer the detailed answers provided for your question on other thread.

ADD REPLYlink modified 5 months ago • written 5 months ago by EagleEye4.4k
0
gravatar for md.rahman
4 months ago by
md.rahman0
md.rahman0 wrote:

Thank you very much. It works perfectly, exactly what I wanted. I have other organism's gene name in my list (Plasmodium, Cryptosporadium, Toxoplasma, Babesia etc). How can I map my gene list for other organisms? In the organism codes we have only GeneDB_Pfalciparum. Is there any way to map for other organisms also? Thanks again for help.

Best Regards Zillur

ADD COMMENTlink written 4 months ago by md.rahman0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1162 users visited in the last hour