Get relationships and hierarchies for GO terms
2
1
Entering edit mode
15 months ago

Hello all,

This is a question for possible resources. I have obtained several GO terms (indicating function and biological process) for several proteins. It is somewhat hard to see common patterns for all of them, and I thought that given the existing hierarchies between GO terms, there should be some existing resource that can take them and return the connection- hierarchy among them.

Does anyone know of a software/resource that does this?

Thank you!

GO-terms Gene-onthology • 2.5k views
ADD COMMENT
2
Entering edit mode

For R-users might be interesting to look at notebook "CAFA5: train set GO terms" https://www.kaggle.com/code/antoninadolgorukova/cafa5-train-set-go-terms-eda

For Python users - the package "obonet" provides lots of functionalites, see its description https://github.com/dhimmel/obonet . And nice notebooks examples: https://www.kaggle.com/code/leonidkulyk/eda-cafa5-pfp-interactive-dags-plotly?scriptVersionId=126338755&cellId=32 or even a "live" widget: https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/402840

More notebooks exploring Gene Ontology can be found on Kaggle here: https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/code

ADD REPLY
1
Entering edit mode

Something like GO enrichment analysis?

ADD REPLY
3
Entering edit mode
15 months ago
LauferVA 4.4k

Hey Daniel,

First, one potential caveat. Above, you say "I have obtained several GO terms (indicating function and biological process) for several proteins". Does this mean you do, or do not have a genome-wide dataset? If you only have a handful of genes, it may not be possible to accurately define what is going on for pathways whose members are largely missing from your list. Also, are you referring to gene expression data, or proteomics?

The challenge What you are proposing is challenging because it is cell-type and cell-state specific. In other words, because gene expression varies according to cell type as well as current environmental conditions, there is not one connection hierarchy that describes how pathways relate, regulate, or influence each other - there is a spectrum of them.

To create a resource that did this, i.e., that defined connection hierarchies across a multitude of cell types and states would require high quality data from specific cell types that was well-annotated. At present, whether or not this resource already exists (and its quality, if it does exist) depends on what exactly you are studying. Even if you do find such a resource, the burden of proof is on that scientist to demonstrate why that pre-existing annotation scheme is thought accurate enough to use to annotate her or his data.

Approaches There are a few other general approaches to doing this, though.

1) If your sample size is large enough, you may be able to derive pathway relationships empirically using your own data. As an example for how this would work, consider 2 pathways, X and Y. Suppose across all 87 of your samples, in every case in which pathway X is upregulated, pathway Y is downregulated. Then, empirically, the data may be indicating that the pathways have feedback on one another (alternatively, it could mean something else is responsible for those effects).

2) One can go to the literature and manually curate the relationships between the pathways about which they are most interested. This has been done, including in high-impact publications, because often there is no sufficiently exact connection hierarchy to date. This may seem arbitrary. In other words, what keeps the scientist from cherry picking what pathways to include in their hierarchy? Having said that, keep in mind that in many cases the use of published pathway ontologies may be no less biased. Finally, many of the published gene sets themselves are actually curated in this way, biased or not. Consider signature set C2 from MSigDB.

Consider, for instance, the hallmark gene sets found in MSigDB. Note that, as the description on this page states, these gene sets are formed by overlapping gene signatures in many datasets in order to find pathways that are generally coherent across multiple experiments. However, this may mean that the unique biology found in one specific experiment isn't perfectly represented by the hallmark set.

Overall, these limitations mean that it is important to look at the dataset from which a given pathway was constructed in order to begin to understand its applicability. For instance, this pathway defines targets of TGFB in hepatocellular carcinoma cells. Because TGFB displays extreme variability in its range of effects depending upon the tissue in which it is found, use of this pathway may or may not be meaningful outside of hepatocellular carcinoma.

-- With all of that as background, here are some places to start --

1) Consider reading the WGCNA manuscript, an article with over 15,000 citations to date. This manuscript will discuss gene co-expression in a way I think that will be helpful to you, and if you identify the functionality you need from the paper (the introduction defines 10 or so analytical goals), you can try the WGCNA package.

2) It is possible with much less work to look at co-variance between lists of genes in your data if you have it narrowed down to a handful of pathways.

Hope some of this helps.

ADD COMMENT
0
Entering edit mode

Thank you so much for such a well structured answer. Indeed, many of your points make a lot of sense to me and I have been thinking about it (for example, the cherry-picking part; I have realised that I have biassed myself to whatever process involved in development and organogenesis and also that I have rather too many possibilities to choose from).

I will try to see if I can take something out of it in, as much as possible, an objective manner.

Thank you

ADD REPLY
1
Entering edit mode

GREAT response - a lot of times people just want to get going, and they want the code to make it work, not an understanding of what they are doing in the first place. But you, I think, are ready for the next steps.

Check this code out, this is a modestly simplified version of what I use in a development-grade pipeline:

########################################################################
############ Part 4: GSEA analysis and pathway visualization ###########
########################################################################
# Install or update, then load, the necessary libraries.
pathwayLibs<-c('limma', 'msigdb', 'ExperimentHub', 'GSEABase', 'GO.db', 'KEGGREST', 'ReactomePA', 'GOSemSim', 'AnnotationHub', 'EnsDb.Hsapiens.v86')
BiocManager::install(pathwayLibs, update = TRUE, ask = FALSE)
lapply(pathwayLibs, library, character.only = TRUE)

# Get up-to-date versions of all curated pathways and gene sets. 
msigdbHsSym7.5= getMsigdb(org = 'hs', id = 'SYM', version = '7.5')          # 45.16k pathways; 40.96k genes.
msigdbHsSym7.5<-appendKEGG(msigdbHsSym7.5)                  # Add KEGG. 45.43k pathways; 40.96k genes.

# Send all the ontologies to a .gmt formatted file:
subsetNames<-listCollections(msigdbHsSym7.5)
AllCollections<-subsetCollection(msigdbHsSym7.5, subsetNames) # transform the entire Db at once.
AllPathways<-geneIds(AllCollections)

I recommend using an implementation like the above for 3 related reasons:

1) Each is maintained through Bioconductor by the relevant consortium itself.

2) Attached metadata. Everything necessary to either understand what was used to construct the pathway (or ontology)ispre-attached to each record. For that reason, if a pathway catches your eye, you can, in the same Rstudio session rapidly (see second comment post, below, the vignette, for how to get that info) determine if its something you want to use or not.

3) Time - both in the sense of saving time and recording the the date of access, etc.

First, let's talk about another, less good, way to do it: I've seen people read pathway data from a flat file they found online or more often, were given by a friend or colleague. Instead of installing a package, they load the data from that file. That may work for one project, but it won't keep you up to date forever... if you generate more than one dataset, its pretty easy to eventually lose track and forget what version or even database your pathways came from. when I was a beginner, I did that. It was so hard enough to even just get code working that I would use any snippet I found that from here or there that seemed to work instead of taking time out to think how it should be done.

Ok, now back to this way. The ironic thing is, ultimately, using packages like these is easier, too. This full form of this same script writes the methods section of the manuscript. That saves the time that the flat file guy spends 1.5 years later, when they have to go try to find out what the hell they used post hoc. by contrast, if one takes the time to understand how to get everything out of each object, this is easy. therefore, at present, anyway this is as close to the way pathway analysis should be done as we can get right now (even if the ontologies themselves still need a lot of work) ...

ADD REPLY
1
Entering edit mode

Reply #2:

The below code is adapted and simplified from the manuals/vignettes of GO.db and msigdb:

# Install, Update, or load all packages needed (not all used in this tutorial):
pathwayLibs<-c('limma', 'msigdb', 'ExperimentHub', 'GSEABase', 'GO.db', 'KEGGREST', 'ReactomePA', 'GOSemSim', 'AnnotationHub', 'EnsDb.Hsapiens.v86')
BiocManager::install(pathwayLibs, update = TRUE, ask = FALSE); lapply(pathwayLibs, library, character.only = TRUE)

##################################################################################################################
########## MSigDb --> First, load MSigDb, KEGG, and Reactome using AnnotationHub() and library(msigdb)############
##################################################################################################################
# browseVignettes('msigdb') # for syntax and/or documentation
msigdbHsSym7.5= getMsigdb(org = 'hs', id = 'SYM', version = '7.5')          # 45.16k pathways; 40.96k genes.
msigdbHsSym7.5<-appendKEGG(msigdbHsSym7.5)                                  # Add KEGG. 45.43k pathways; 40.96k genes.

# Survey the number of ontologies in each at the level of signature (c1:c8) or hallmarks
table(sapply(lapply(msigdbHsSym7.5, collectionType), bcCategory))
table(sapply(lapply(msigdbHsSym7.5, collectionType), bcSubCategory))
hist(sapply(lapply(msigdbHsSym7.5, geneIds), length), main = 'MSigDB signature size distribution', xlab = 'Signature size') # make a hist showing signature size.

### Commands to List sets of ontologies.
listCollections(msigdbHsSym7.5)         # list out individual Collections singly.
listSubCollections(msigdbHsSym7.5). # list out individual SubCollections singly.

### Commands to obtain subsets of ontologies
hallmarkPathways<-subsetCollection(msigdbHsSym7.5, 'h') # this corresponds to pulling out the collections that the above commands list...
GO_Pathways<-subsetCollection(msigdbHsSym7.5, 'c5', 'GO:BP')

### Commands relating to a single pathway or ontology, including how to obtain metadata
gs = msigdbHsSym7.5[[1000]]                         # Access a single record (no matter the collection)
geneIds(gs)                                         # return the geneIDs in a given pathway
collectionType(gs)                                  # get metadata about the type of ontology it is
bcCategory(collectionType(gs)); bcSubCategory(collectionType(gs))
description(gs)                  # good single line summary 
details(gs)                         # more exhaustive metadata about the collection, corresponding to the same info. you'd see on a pathway webpage, *e.g.*: https://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_IL6_JAK_STAT3_SIGNALING

######### Send ontologies in msigdb database to GTF file or other programs #########
#  Any gene-set collection can be easily transformed to gtf or for usage with limma::fry 
#  1) Transform some subset of all pathways into a list of gene IDs 
hallmarks = subsetCollection(msigdbHsSym7.5, 'h')       # grab hallmark pathways as above (using subsetCollection() )
msigdb_ids = geneIds(hallmarks)                         # Similar to the output of reading using `gmtPathways()`

#  2) For limma::fry, we need second transform based on indices: 
fry_indices = ids2indices(msigdb_ids, rownames(emat)); fry_indices[1:2] # will return list of pathway names with character vector of gene names.
#> $HALLMARK_TNFA_SIGNALING_VIA_NFKB
#>   [1]   109   257   369   545   567   661   821   837   966   980  1299  1502
#>  [13]  1528  2155  2365  2873  2940  3090  3205  3208  3252  3931  3970  3974

##########################################################################
#############  GO: curated relationships between ontologies  #############
##########################################################################
# msigdb has GO in c5. However, it does NOT have the complex relationships found in GO.db:
ls('package:GO.db') # All of these are different relationships between ontologies.
#  [1] "GO"            "GO_dbconn"     "GO_dbfile"     "GO_dbInfo"     "GO_dbschema"   "GO.db"         "GOBPANCESTOR"  "GOBPCHILDREN"  "GOBPOFFSPRING" "GOBPPARENTS"   "GOCCANCESTOR"  "GOCCCHILDREN"  "GOCCOFFSPRING" "GOCCPARENTS"  
# [15] "GOMAPCOUNTS"   "GOMFANCESTOR"  "GOMFCHILDREN"  "GOMFOFFSPRING" "GOMFPARENTS"   "GOOBSOLETE"    "GOSYNONYM"     "GOTERM"

# Pull any of these out with as.list():
GOdbList <- as.list(GOTERM)

# Write an easily readable version of the session info to a text file.
fPath<-writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
ADD REPLY
0
Entering edit mode

Hello again,

I am so sorry for not having replied to this in a long time. I have lost the habit of keeping both stackoverflow and biostars open. I have to admit to you, since I have dealing with this same paper for so long (and with some pain attached), part of me really wants to get going and get it out of the way, making sure that I am not saying anything wrong of course. At the same time, given the dataset I have (you asked previously if it is genome-wide I can reply that the SNPs I get are in all chromosomes but they are just a handful), I just have not enough trust on finding out common, biologically-relevant-for-the-sake-of-the-study pathways. Once I had the hits from the blasting run, and a bunch of proteins, I wanted to see if out of luck they would show commonalities, then I posted the question. I will carefully go again through your full answer (such an amazing effort from your part, thank you so much). Otherwise, I believe I am just going to be honest and state that although I have protein hits, the data is not sufficient to make an informed inference on specific selection patterns (which I would say is true enough).

Good summer and thank you!

ADD REPLY
0
Entering edit mode

An add-on. The data I get is from genetic variation (DNA), not genetic expression (something that you could study with RNAseq). I believe from what you wrote "What you are proposing is challenging because it is cell-type and cell-state specific. In other words, because gene expression varies according to cell type as well as current environmental conditions, there is not one connection hierarchy that describes how pathways relate, regulate, or influence each other - there is a spectrum of them." you are understanding that I am talking about genetic expression results. I just wanted to clarify, the idea is to determine how likely are these sequences to be selective: having protein hits is a point in favour, showing that they are involved in similar pathways would be another point in favour. I hope this makes sense

ADD REPLY
2
Entering edit mode
15 months ago
rfran010 ★ 1.3k

Maybe stringdb? seems to have an option to do what you describe.

https://string-db.org/cgi/input?sessionId=bmYwdg48WePq&input_page_active_form=single_term

ADD COMMENT

Login before adding your answer.

Traffic: 1592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6