Question: extract information from Uniprot
0
gravatar for Learner
10 weeks ago by
Learner 130
Learner 130 wrote:

I am wondering if anyone knows any program, script that one can use to retrieve over 100 gene information. Basically I want to get the info related to "Biological process", "Molecular function" and "Cellular component"

Thanks a bunch

genome • 290 views
ADD COMMENTlink modified 10 weeks ago by sammer.kamal9110 • written 10 weeks ago by Learner 130

Can you explain what your input is? It may be a grep on a file in ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN but I can't tell from your question.

ADD REPLYlink written 10 weeks ago by Alex Reynolds27k

@Alex Reynolds the input can either be protein name or gene name. for instance, lets use a list of 7 genes from Human

ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1
ADD REPLYlink written 10 weeks ago by Learner 130

https://www.uniprot.org/uniprot/?query=gene:BMP4+AND+reviewed:yes+AND+organism:9606#goViewBy
https://www.uniprot.org/uniprot/?query=gene:ELANE+AND+reviewed:yes+AND+organism:9606#goViewBy

Construct others as needed.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by genomax62k

@genomax this requires to go one by one in the Uniprot and then try to copy and paste the info from there. It is impossible when you have 100 or even more gene . Do you know a better way ?

ADD REPLYlink written 10 weeks ago by Learner 130

These queries can be programmatically constructed. You will find help from UniProt here. They may also have a downloadble file on FTP site that could be queried. As Alex said other resources may have this information more readily available.

ADD REPLYlink written 10 weeks ago by genomax62k

Google: retrieve uniprot mapping. Any luck?

Tell us what you have as your identifiers/ file formats. Print the head of your list/file.

ADD REPLYlink written 10 weeks ago by Biogeek340

@Biogeek I gave an example above. A list of genes and of course I could not find anything in google. Please use the following gene names as example

ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1

format can be txt, xls or whatever else if needed

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Learner 130
3
gravatar for Alex Reynolds
10 weeks ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Given a list of IDs:

$ cat /tmp/list.txt 
ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1

Grab the GAF file of UniProt id-to-GO mappings:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gunzip -c > /tmp/goa_human.gaf

Query your list of identifiers:

$ grep -wf /tmp/list.txt /tmp/goa_human.gaf > /tmp/query_results.txt

Use GO.db in R to read in GO data, and read your query results into a data frame to get mapped GO terms:

> library("GO.db")
> go_term_table <- toTable(GOTERM)
> df <- read.table("/tmp/query_results.txt", header=F, fill=T)
> ids <- unique(df$V4)
> unique_go_ids <- ids[grepl("^GO:", ids)]

You can then query the GO term table against your identifiers; for example, for the Biological Process ontology:

> biological_process <- go_term_table[go_term_table$Ontology == "BP" & go_term_table$go_id %in% unique_go_ids, ]

Repeat as needed for the other ontologies. Use write.table and similar to write R results to a file, if needed.

See: http://bioconductor.org/packages/release/data/annotation/html/GO.db.html for information on how to install GO.db.

ADD COMMENTlink written 10 weeks ago by Alex Reynolds27k
1

This was exceptionally helpful, and I appreciate you taking the time to write this out. I'll add for future individuals who come across this who have gene lists similar to the OPs - using fgrep instead of grep can lead to substantial increases in speed when the list.txt file is long.

https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up

ADD REPLYlink written 4 weeks ago by Collin10

@Alex Reynolds do you know about the "Molecular function" and "Cellular component", I think I should use MF and CC

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Learner 130

Seems reasonable to use.

ADD REPLYlink written 10 weeks ago by Alex Reynolds27k

@Alex Reynolds do you know how to understand which info I can extract from go_term_table ? actually I tried to list info using ?go_term_table or help but does not show anything. I also googled it with no success. I would appreciate if you could direct me to some info. basically I want to add the gene name to gene ID , definition etc

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Learner 130

go_term_table is the name of a variable, so you're not going to get anything out of R from running ?go_term_table.

Run ?toTable if you want to learn about that command, but maybe start with the vignette and then read documentation about specific commands:

• https://www.bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/GOusage.pdf

• http://bioconductor.org/packages/release/data/annotation/manuals/GO.db/man/GO.db.pdf

ADD REPLYlink written 10 weeks ago by Alex Reynolds27k

@Alex Reynolds Thanks for the link . is it possible somehow to keep the information from "query_results" merged with the GO? or at least seeing the gene name ? I think what you get from the first part is the GO ids and then you extract the data from GO.db.

ADD REPLYlink written 10 weeks ago by Learner 130

Maybe use join functions to connect the go_term_table lookup with results from df (query_results.txt): https://dplyr.tidyverse.org/reference/join.html

I'd think you could join on the GO:xyz identifier, for instance.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Alex Reynolds27k

@Alex Reynolds I think there are many genes are assigned to one GO, do you think it is possible to do that before you do this ? ids <- unique(df$V4)

ADD REPLYlink written 8 weeks ago by Learner 130
0
gravatar for sammer.kamal91
10 weeks ago by
sammer.kamal9110 wrote:

U can use UniProt for a list click on retreive/ID mapping https://www.uniprot.org/uploadlists/ 1- enter yr list as a file or a copied text. 2- specify your list identifiers. In case of gene name U optionally can specify a species other wise all species contain these gene name will be included in yr result.

you can control what is in yr results table. U need BP. MF, and CC so you need to edit the columns to view them so tick them from Gene Ontology GO tab.

https://www.uniprot.org/uniprot/?query=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34&sort=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34

ADD COMMENTlink written 10 weeks ago by sammer.kamal9110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1272 users visited in the last hour