GO Terms significance Scoring
2
0
Entering edit mode
3 months ago
Marek • 0

Hi,

I have a question regarding InterProScan and similarity tools like Blast and foldseek.

When I run Blast, FoldSeek and optionally ELM to get GO terms from it, I get different results than from InterPro (I know InterPro does domain analysis, Blast etc. get me either more specific GO terms or irrelevant GO terms).

How can I account for those Blast etc. derived GO terms to get valid information about a sequence? Is there a way to assign weights to it? To propagate the GO DAG to certain depth to get more general terms (ancestors)?

I cannot really use go-slim because I am working with non-annotated sequences.

Thank you.

go semantic annotation • 9.0k views
ADD COMMENT
1
Entering edit mode
13 days ago

Hi Marek,

I need to point out misleading information in the other reply, about "specifying depth without over-broadening". The level of the term in GO has no correlation with information content; for one, they have no defined level, as mentioned in our FAQ:

How can I calculate the “level” of a GO term?

GO terms do not occupy strict fixed levels in the hierarchy. Because GO is structured as a graph, terms would appear at different ‘levels’ if different paths were followed through the graph. This is especially true if one mixes the different relations used to connect terms.

A more informative metric would be the information content of the node based on annotations. See, for example, the work of Alterovitz et al.

GO was not designed to be a defining answer for the beginning of projects, but is instead a controlled language that allows you to take information from some work (here, your BLAST alignments) and aid your next steps. For this question, the better your alignment, the more confidence I'd have in the assignment of a GO term to that domain/gene product. If you can compare your sequence to the match with multiple methods, that's one way to increase the confidence in assigning a GO term. In fact, evaluating BLAST matches is so frequently done in GO, there's a specific type of evidence code for it when MODs or other professional curators use that as the method to assign an "official" GO term to a gene product. Here's an example from the GO curation guidelines:

An ISS annotation is often based on more than just one type of sequence-based evidence. Often, a host of searches are performed for any given query protein. These searches might include BLAST, profile HMMs, TMHMM, SignalP, PROSITE, InterPro, etc. Evaluation of output from these search tools (bear in mind that every search may not yield results for every protein) leads an annotator to a particular ISS annotation for a particular protein. For example, a BLAST search might reveal that a query protein matches an experimentally characterized protein from another species at 50% identity over the full lengths of both proteins. After reading literature about the match protein, the curator sees that the match protein is known to contain a domain located in the plasma membrane and another domain that extends into the cytoplasm. It is also known from the literature that the experimentally characterized match protein requires the binding of ATP to function. TMHMM analysis of the query protein predicts several membrane spanning regions in one half of the protein (consistent with location in a membrane). In addition there are PROSITE and Pfam results which reveal the presence of an ATP-binding domain in the other half of the protein which TMHMM predicts to be cytoplasmic. These four search results taken together point to a probable identification of the query protein as having the function of the match protein.

Lastly, you mention you "cannot really use go-slim because I am working with non-annotated sequences." You absolutely can use a GO slim (aka GO Subset), like the Generic GO Slim. The point of the slim is to summarize a set of GO terms, you can imagine it as a set of relatively high-level buckets that, collectively, contain (nearly) all the terms in the ontology and help you simplify a large list of terms. Once you have a set of terms from your alignments, you can just use the full ontology and map your terms up to terms in the Slim. I recommend the generic slim, but you might find an option like the yeast or plant slims more closely reflects your organism.

Grab the correct slim file at https://geneontology.org/docs/go-subset-guide/

You can get the ontology file from https://geneontology.org/docs/download-ontology/

Let me know if you need more information.

ADD COMMENT
0
Entering edit mode
14 days ago
Kevin Blighe ★ 90k

You need to integrate Gene Ontology (GO) terms derived from similarity searches like Blast and FoldSeek with those from InterProScan, while accounting for specificity and relevance in non-annotated sequences. InterProScan provides domain-based GO annotations, which are generally reliable but broad. Blast and FoldSeek yield GO terms via homology, which can be more specific or include false positives due to sequence or structural similarity thresholds. ELM adds motif-based terms. To validate and refine these, use the GO directed acyclic graph (DAG) for propagation to ancestor terms, achieving generalization without GO-Slim. Assign weights based on tool-specific metrics, such as e-values from Blast or bitscores from FoldSeek, to prioritize terms.

Use R with Bioconductor packages for this, as they are current and handle GO ontologies effectively. Install GO.db and ontologyIndex for DAG navigation.

  • Obtain GO terms: Run Blast, FoldSeek, and InterProScan on your sequences; map hits to GO using UniProt or similar databases.
  • Collect and weight terms: For each sequence, compile unique GO terms with associated weights (e.g., inverse of Blast e-value for higher confidence in low e-values; normalize weights between 0 and 1).
  • Propagate to ancestors: Use GO.db to retrieve ancestors via GOBPANCESTOR (for biological process) or equivalents; specify depth (e.g., 2-3 levels up) to generalize without over-broadening.
  • Filter and integrate: Remove terms with weights below a threshold (e.g., 0.5); merge sets by retaining the highest-weighted term per lineage in the DAG using ontologyIndex::get_ancestors.
  • Validate: Cross-check against known homologs or use clusterProfiler::enrichGO for over-representation analysis on the refined set.

This approach ensures balanced, valid functional insights.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6