extract information from Uniprot
2
0
Entering edit mode
3.8 years ago
Learner ▴ 250

I am wondering if anyone knows any program, script that one can use to retrieve over 100 gene information. Basically I want to get the info related to "Biological process", "Molecular function" and "Cellular component"

Thanks a bunch

genome • 1.6k views
0
Entering edit mode

Can you explain what your input is? It may be a grep on a file in ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN but I can't tell from your question.

0
Entering edit mode

@Alex Reynolds the input can either be protein name or gene name. for instance, lets use a list of 7 genes from Human

ERVMER34-1
BMP4
DNAJA1
ELANE
GZMB
RACK1
DNAJB1

0
Entering edit mode
0
Entering edit mode

@genomax this requires to go one by one in the Uniprot and then try to copy and paste the info from there. It is impossible when you have 100 or even more gene . Do you know a better way ?

0
Entering edit mode

These queries can be programmatically constructed. You will find help from UniProt here. They may also have a downloadble file on FTP site that could be queried. As Alex said other resources may have this information more readily available.

0
Entering edit mode

Google: retrieve uniprot mapping. Any luck?

0
Entering edit mode

@Biogeek I gave an example above. A list of genes and of course I could not find anything in google. Please use the following gene names as example

ERVMER34-1
BMP4
DNAJA1
ELANE
GZMB
RACK1
DNAJB1


format can be txt, xls or whatever else if needed

3
Entering edit mode
3.8 years ago

Given a list of IDs:

$cat /tmp/list.txt ERVMER34-1 BMP4 DNAJA1 ELANE GZMB RACK1 DNAJB1  Grab the GAF file of UniProt id-to-GO mappings: $ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gunzip -c > /tmp/goa_human.gaf


$grep -wf /tmp/list.txt /tmp/goa_human.gaf > /tmp/query_results.txt  Use GO.db in R to read in GO data, and read your query results into a data frame to get mapped GO terms: > library("GO.db") > go_term_table <- toTable(GOTERM) > df <- read.table("/tmp/query_results.txt", header=F, fill=T) > ids <- unique(df$V4)
> unique_go_ids <- ids[grepl("^GO:", ids)]


You can then query the GO term table against your identifiers; for example, for the Biological Process ontology:

> biological_process <- go_term_table[go_term_table$Ontology == "BP" & go_term_table$go_id %in% unique_go_ids, ]


Repeat as needed for the other ontologies. Use write.table and similar to write R results to a file, if needed.

See: http://bioconductor.org/packages/release/data/annotation/html/GO.db.html for information on how to install GO.db.

1
Entering edit mode

This was exceptionally helpful, and I appreciate you taking the time to write this out. I'll add for future individuals who come across this who have gene lists similar to the OPs - using fgrep instead of grep can lead to substantial increases in speed when the list.txt file is long.

https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up

0
Entering edit mode

@Alex Reynolds do you know about the "Molecular function" and "Cellular component", I think I should use MF and CC

0
Entering edit mode

Seems reasonable to use.

0
Entering edit mode

@Alex Reynolds do you know how to understand which info I can extract from go_term_table ? actually I tried to list info using ?go_term_table or help but does not show anything. I also googled it with no success. I would appreciate if you could direct me to some info. basically I want to add the gene name to gene ID , definition etc

0
Entering edit mode

go_term_table is the name of a variable, so you're not going to get anything out of R from running ?go_term_table.

Run ?toTable if you want to learn about that command, but maybe start with the vignette and then read documentation about specific commands:

0
Entering edit mode

@Alex Reynolds Thanks for the link . is it possible somehow to keep the information from "query_results" merged with the GO? or at least seeing the gene name ? I think what you get from the first part is the GO ids and then you extract the data from GO.db.

0
Entering edit mode

Maybe use join functions to connect the go_term_table lookup with results from df (query_results.txt): https://dplyr.tidyverse.org/reference/join.html

I'd think you could join on the GO:xyz identifier, for instance.

0
Entering edit mode

@Alex Reynolds I think there are many genes are assigned to one GO, do you think it is possible to do that before you do this ? ids <- unique(df\$V4)

0
Entering edit mode
3.8 years ago

U can use UniProt for a list click on retreive/ID mapping https://www.uniprot.org/uploadlists/ 1- enter yr list as a file or a copied text. 2- specify your list identifiers. In case of gene name U optionally can specify a species other wise all species contain these gene name will be included in yr result.

you can control what is in yr results table. U need BP. MF, and CC so you need to edit the columns to view them so tick them from Gene Ontology GO tab.

https://www.uniprot.org/uniprot/?query=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34&sort=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34