You can explore the Cancer Hallmarks Analytics Tool (CHAT), which employs text mining to categorize scientific literature according to the Hanahan and Weinberg cancer hallmarks. Although designed for PubMed abstracts, you may input pathway descriptions from KEGG or Reactome to classify them into hallmarks. Access it at http://chat.lionproject.net/.
For pretrained language models, BioBERT is suitable for computing semantic similarity between pathway descriptions and hallmark definitions without requiring deep machine learning coding. Retrieve pathway descriptions using Biopython's Bio.KEGG or Reactome APIs, then use the Hugging Face Transformers library to generate embeddings and calculate cosine similarity.
Here is a basic Python notebook outline you can adapt:
# Install required packages (run once)
!pip install transformers biopython torch scipy
import torch
from transformers import AutoTokenizer, AutoModel
from scipy.spatial.distance import cosine
from Bio.KEGG import REST # For KEGG example
# Load BioBERT
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1-pubmed")
model = AutoModel.from_pretrained("dmis-lab/biobert-v1.1-pubmed")
# Define cancer hallmarks (example; expand with full definitions)
hallmarks = {
"Sustaining proliferative signaling": "Cancer cells acquire the capability to sustain proliferative signaling...",
# Add other hallmarks here
}
# Function to get embedding
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).detach().numpy()
# Example: Fetch KEGG pathway description
pathway_id = "hsa05200" # Pathways in cancer
pathway_desc = REST.kegg_get(pathway_id).read().split("\n")[1] # Parse description
# Compute similarities
pathway_emb = get_embedding(pathway_desc)
similarities = {}
for name, desc in hallmarks.items():
hall_emb = get_embedding(desc)
similarities[name] = 1 - cosine(pathway_emb.flatten(), hall_emb.flatten())
print(similarities)
This script requires minimal modification. Run it in Google Colab for ease. For Reactome, use their content service API instead of Bio.KEGG.
Kevin