Best way to do protein structure guided similarity search?
2
0
Entering edit mode
9 weeks ago
Mark ▴ 60

I don't have much as much protein level analysis as I do DNA analysis, so I'm not really sure what the best tools or workflows currently out there are for this, but the problem is fairly straight forward.

I'm looking for a workflow that basically acts like BLAST, but instead of looking for sequence similarity, it check for gene product structure similarity.

Basically I have a many variants of a certain gene and I want to check if this new organisms genome potentially encodes for a protein that is structurally similar to any of these variants. I could use BLAST to achieve this task as well but with modern tools I'm sure that it's possible to do this on a structural level too and I want to be thorough.

Does anyone have any advice?

alphafold BLAST structure proteins • 7.7k views
ADD COMMENT
2
Entering edit mode

Foldseek perhaps? They have a server for it too: https://search.foldseek.com/ .

ADD REPLY
0
Entering edit mode

There are tools that sort of do this with sequence information still.

I would consider looking at HMM based methods for starters (hhsuite is really good). This intrinsically captures domain structure similarly in very distantly related domains/proteins (as low as even 20-30% sequence similarity).

This doesnt always capture the 'edge case' of convergent evolution of structures but in my experience works pretty well nevertheless.

If memory serves there are some TM-align/RMSD-based search methods but I'm blanking on what they are at the moment.

Worth bearing in mind that (save for alphafold simulations) there are a lot more sequences with good sequence data than there are resolved protein structures.

ADD REPLY
0
Entering edit mode

Something along the line of working with HMMs: you could also consider using PSI-BLAST. (though equally non-modern like HMM searches :) )

ADD REPLY
0
Entering edit mode
4 days ago
Kevin Blighe ★ 90k

For your query on identifying structurally similar proteins in a new organism's genome to variants of a specific gene, consider the following workflow. This approach extends beyond sequence-based tools like BLAST by incorporating structural predictions and comparisons.

First, annotate the genome to identify open reading frames and predict protein sequences. Use Prokka for prokaryotes or AUGUSTUS for eukaryotes, as these tools generate reliable FASTA files of translated proteins.

Next, predict 3D structures for the predicted proteins and your gene variants. AlphaFold3, accessible via the ColabFold server or local installation, is effective for this. Install it with:

mamba create -n colabfold -c conda-forge -c bioconda colabfold-batch
mamba activate colabfold
colabfold_batch input.fasta output_dir

This handles batches efficiently.

Then, perform structural similarity searches using Foldseek, which functions like a structural BLAST. It encodes structures as 3Di sequences for rapid comparisons. Download and install Foldseek from its GitHub repository:

git clone https://github.com/steineggerlab/foldseek.git
cd foldseek
mkdir build && cd build
cmake ..
make

Create a database from your variant structures:

foldseek createdb variants_pdbs/ variants_db
foldseek createindex variants_db tmp

Search the new organism's predicted structures:

foldseek easy-search new_proteins_pdbs/ variants_db results.tsv tmp --exhaustive-search

This outputs alignments with TM-scores for similarity assessment. For even faster searches at scale, consider SSAlign (released in 2025), which supports large datasets and integrates with AlphaFold outputs.

If computational resources are limited, pre-filter candidates with HH-suite for sensitive sequence searches before structural validation, as it captures distant homologs.

This workflow ensures thoroughness by prioritizing structure over sequence alone.

Kevin

ADD COMMENT
0
Entering edit mode
4 days ago
yl759 ▴ 120

Besides predict all of their structures and then use Foldseek to search, you could also use TM-vec to predict structural similarity by just sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 3035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6