For your query on identifying structurally similar proteins in a new organism's genome to variants of a specific gene, consider the following workflow. This approach extends beyond sequence-based tools like BLAST by incorporating structural predictions and comparisons.
First, annotate the genome to identify open reading frames and predict protein sequences. Use Prokka for prokaryotes or AUGUSTUS for eukaryotes, as these tools generate reliable FASTA files of translated proteins.
Next, predict 3D structures for the predicted proteins and your gene variants. AlphaFold3, accessible via the ColabFold server or local installation, is effective for this. Install it with:
mamba create -n colabfold -c conda-forge -c bioconda colabfold-batch
mamba activate colabfold
colabfold_batch input.fasta output_dir
This handles batches efficiently.
Then, perform structural similarity searches using Foldseek, which functions like a structural BLAST. It encodes structures as 3Di sequences for rapid comparisons. Download and install Foldseek from its GitHub repository:
git clone https://github.com/steineggerlab/foldseek.git
cd foldseek
mkdir build && cd build
cmake ..
make
Create a database from your variant structures:
foldseek createdb variants_pdbs/ variants_db
foldseek createindex variants_db tmp
Search the new organism's predicted structures:
foldseek easy-search new_proteins_pdbs/ variants_db results.tsv tmp --exhaustive-search
This outputs alignments with TM-scores for similarity assessment. For even faster searches at scale, consider SSAlign (released in 2025), which supports large datasets and integrates with AlphaFold outputs.
If computational resources are limited, pre-filter candidates with HH-suite for sensitive sequence searches before structural validation, as it captures distant homologs.
This workflow ensures thoroughness by prioritizing structure over sequence alone.
Kevin
Foldseekperhaps? They have a server for it too: https://search.foldseek.com/ .There are tools that sort of do this with sequence information still.
I would consider looking at HMM based methods for starters (
hhsuiteis really good). This intrinsically captures domain structure similarly in very distantly related domains/proteins (as low as even 20-30% sequence similarity).This doesnt always capture the 'edge case' of convergent evolution of structures but in my experience works pretty well nevertheless.
If memory serves there are some TM-align/RMSD-based search methods but I'm blanking on what they are at the moment.
Worth bearing in mind that (save for alphafold simulations) there are a lot more sequences with good sequence data than there are resolved protein structures.
Something along the line of working with HMMs: you could also consider using PSI-BLAST. (though equally non-modern like HMM searches :) )