Question: Identifying closest homologue of a protein sequence
gravatar for nathanielsaxe
5.7 years ago by
United Kingdom
nathanielsaxe10 wrote:



I have this list of proteins from a new genome project so its pretty much unannotated. However, its closely related to C. elegans so i was thinking of trying to identify the closest  C. elegans homologues. 

what i've been doing right now is doing a protein blast in ncbi with the protein sequences and then taking the top C. elegans hit, however, there are far too many sequences to be able to do this one at a time, so I was wondering if theres a way to do it faster/automated/program that does it for me.



blast • 1.9k views
ADD COMMENTlink modified 5.7 years ago by Prakki Rama2.4k • written 5.7 years ago by nathanielsaxe10
gravatar for alexjironkin
5.7 years ago by
United Kingdom
alexjironkin10 wrote:

Try using HMMER (

The manual is available here:


IN BRIEF: for each protein sequence in C. elegans you make a HMM using hmmbuild command. Concatenate all HMM models into a single file to make a database file. You have to use hmpress to create additional files in order to search your database. Now you can use either phmmer  if you want to scan against the database you have just created or hmmsearch to scan individual models against the sequences you have. The documentation describes very well what commands you need, but note the subtle differences of scanning model vs set of sequences and set of sequences vs db of models.


If you have access to a parallel environment such as MPI (OpenMPI can usually be installed even on the local machines to take full advantage of multiple cores) then you can build the HMMER with MPI support to increase throughput.


A rough idea of a time in our use was:  Building and pressing a database of ~10k models takes 10 mins (ish) scanning a coding sequence against a database of ~10k models takes 2-3 seconds. This is very rough guide that we have used it, which undoubtedly will differ from your use case.

ADD COMMENTlink written 5.7 years ago by alexjironkin10
gravatar for 5heikki
5.7 years ago by
5heikki8.7k wrote:

Standalone blast


In brief:

blastp -query yourSeqs.fasta -subject CelegansSeqs.fasta (or make a db from them so you can multithread) -seg yes -soft_masking true -use_sw_tback -num_threads X (if you made a db, X for number of threads you CPU supports) -out seqs-vs-Celegangs.tsv -outfmt 6


Output only best hits:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

sort -k1,1 -k12,12gr -k11,11g -k3,3gr seqs-vs-Celegangs.tsv | sort -u -k1,1 --merge > bestHits


There's a manual in the link too. The flags in blastp are for best homolog detection. These are from a publication, although I can't remember which one..

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by 5heikki8.7k
gravatar for David Fredman
5.7 years ago by
David Fredman1.0k
University of Bergen, Norway
David Fredman1.0k wrote:

I would suggest calling orthologs and paralogs between your species and C. elegans using the offline version of Inparanoid (by the Sonnhammer lab), which will essentially perform bi-directional Blast, and call orthologs with sensible cutoffs. It's very easy to run, and you can obtain it (by request) here:

Other alternatives would include

Proteinortho ( ,

OrthoMCL ( OrthoMCL

or mapping your proteins to the pre-calculated orthologous groups in eggNOG (

ADD COMMENTlink written 5.7 years ago by David Fredman1.0k
gravatar for Prakki Rama
5.7 years ago by
Prakki Rama2.4k
Prakki Rama2.4k wrote:

You can also take a look reciprocal smallest distance

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Prakki Rama2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour