Question: Where can I download GO terms and their associated E. coli genes?
0
gravatar for O.rka
22 days ago by
O.rka110
O.rka110 wrote:

I'm trying to download a flat file that has the following info:

  1. GOTERM
  2. GOTERM DESCRIPTION
  3. GOTERM SET (Biological process, molecular functions, cellular components)
  4. GENE LIST (either in EcoCyc (e.g. EG10894) , Uniprot (e.g. P0A8V2), or Blattner (e.g. b3987).

Preferably a flat file that I could download from a website but open to Python or R as well.

I have access to EcoCyc flat files but I can't find anything about GO terms in there; though, they are on the website.

Does anyone know where/how I can do this?

gene • 192 views
ADD COMMENTlink modified 22 days ago by EagleEye6.4k • written 22 days ago by O.rka110
1
gravatar for EagleEye
22 days ago by
EagleEye6.4k
Sweden
EagleEye6.4k wrote:

Have a look at this posts,

A: How To Get Gene List From Each Gene Ontology Term?

A: How to look up GO terms associated to a certain organism?

ADD COMMENTlink written 22 days ago by EagleEye6.4k

Thanks for this. GeneSCF looks like a good but the formatting is extremely unusual. For example GO:0000049tyajQ,tsaC,trmA,selB,truA,trmO,rlmN,dusC,tmcA,truB,arfA,thiI,rplP,epmA,lysS,lysU,tmolecular_function~ there seems to be multiple delimiteds like t and ,. I could create a parser for this but I don't want to create an error not knowing all of the rules (as this is only a single case). Is there a way to get this into a more consistent format that I could load into a dataframe?

ADD REPLYlink written 21 days ago by O.rka110

Hi,

Glad that GeneSCF was helpful. The downloaded file with 'prepare_database' format follows these rules,

GOID1~GONAME1<TAB>Gene1,Gene2
GOID2~GONAME2<TAB>Gene1,Gene2
ADD REPLYlink written 21 days ago by EagleEye6.4k

There is a t instead of tab character on my download but I think it should be ok. Are GO characters always 7 digits?

ADD REPLYlink written 20 days ago by O.rka110
1

Are GO characters always 7 digits? YES

There is a t instead of tab character on my download but I think it should be ok.

Warning: Make sure to check system requirements to run GeneSCF. GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint,Cent OS and Windows 10 bash (version 1607 and above). Other distributions of Linux might work as well.

I just downloaded fresh version of GeneSCF and verified if there are any problem as you mentioned. I am not able to reproduce your error or misformat issues. I am attaching the screenshot of downloaded sample results from GeneSCF for ecocyc.

./prepare_database -db=GO_all -org=ecocyc
Downloading GO database....
Extracting ecocyc information...
Updating gene information...
Do not panic. The processing is going on...
Database retreived..You are now ready to use geneSCF with organism ecocyc from --database GO
Done....Mon May 27 22:52:47 CEST 2019

enter image description here

ADD REPLYlink written 20 days ago by EagleEye6.4k
1

Thanks again, this is extremely helpful. I ran it on OSX but I will run in again when I get to lab tomorrow on my Linux machine.

ADD REPLYlink written 20 days ago by O.rka110

Yes, that will solve the issue.

ADD REPLYlink written 20 days ago by EagleEye6.4k
0
gravatar for SMK
22 days ago by
SMK1.3k
Ghent, Belgium
SMK1.3k wrote:

A tricky way but works well using R:

> library(tidyverse)
> library(GO.db)
> uniprot_id2go <-
+   read_tsv("https://www.uniprot.org/uniprot/?query=organism:83333&format=tab&columns=id,go-id") %>%
+   separate_rows(., `Gene ontology IDs`, sep = "; ") %>%
+   as.data.frame()
Parsed with column specification:
cols(
  Entry = col_character(),
  `Gene ontology IDs` = col_character()
)
> uniprot_id2go$Desc <- Term(uniprot_id2go$`Gene ontology IDs`)
> uniprot_id2go$Ontology <- Ontology(uniprot_id2go$`Gene ontology IDs`)
> str(uniprot_id2go)
'data.frame':   21436 obs. of  4 variables:
 $ Entry            : chr  "P07813" "P07813" "P07813" "P07813" ...
 $ Gene ontology IDs: chr  "GO:0002161" "GO:0004823" "GO:0005524" "GO:0005829" ...
 $ Desc             : chr  "aminoacyl-tRNA editing activity" "leucine-tRNA ligase activity" "ATP binding" "cytosol" ...
 $ Ontology         : chr  "MF" "MF" "MF" "CC" ...
> uniprot_id2go %>% filter(Entry == "P0A8V2")
   Entry Gene ontology IDs                                       Desc Ontology
1 P0A8V2        GO:0003677                                DNA binding       MF
2 P0A8V2        GO:0003899 DNA-directed 5'-3' RNA polymerase activity       MF
3 P0A8V2        GO:0005737                                  cytoplasm       CC
4 P0A8V2        GO:0005829                                    cytosol       CC
5 P0A8V2        GO:0006351               transcription, DNA-templated       BP
6 P0A8V2        GO:0016020                                   membrane       CC
7 P0A8V2        GO:0032549                     ribonucleoside binding       MF

Hope it helps.

ADD COMMENTlink modified 22 days ago • written 22 days ago by SMK1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour