Identifying closest homologue of a protein sequence
4
1
Entering edit mode
7.0 years ago

HI,

I have this list of proteins from a new genome project so its pretty much unannotated. However, its closely related to C. elegans so i was thinking of trying to identify the closest  C. elegans homologues.

what i've been doing right now is doing a protein blast in ncbi with the protein sequences and then taking the top C. elegans hit, however, there are far too many sequences to be able to do this one at a time, so I was wondering if theres a way to do it faster/automated/program that does it for me.

Thanks!

blast • 2.3k views
1
Entering edit mode
7.0 years ago
alexjironkin ▴ 10

Try using HMMER (http://hmmer.janelia.org/).

The manual is available here: ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf

IN BRIEF: for each protein sequence in C. elegans you make a HMM using hmmbuild command. Concatenate all HMM models into a single file to make a database file. You have to use hmpress to create additional files in order to search your database. Now you can use either phmmer  if you want to scan against the database you have just created or hmmsearch to scan individual models against the sequences you have. The documentation describes very well what commands you need, but note the subtle differences of scanning model vs set of sequences and set of sequences vs db of models.

If you have access to a parallel environment such as MPI (OpenMPI can usually be installed even on the local machines to take full advantage of multiple cores) then you can build the HMMER with MPI support to increase throughput.

A rough idea of a time in our use was:  Building and pressing a database of ~10k models takes 10 mins (ish) scanning a coding sequence against a database of ~10k models takes 2-3 seconds. This is very rough guide that we have used it, which undoubtedly will differ from your use case.

0
Entering edit mode
7.0 years ago
5heikki 9.8k

Standalone blast

In brief:

blastp -query yourSeqs.fasta -subject CelegansSeqs.fasta (or make a db from them so you can multithread) -seg yes -soft_masking true -use_sw_tback -num_threads X (if you made a db, X for number of threads you CPU supports) -out seqs-vs-Celegangs.tsv -outfmt 6

Output only best hits:

export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8

sort -k1,1 -k12,12gr -k11,11g -k3,3gr seqs-vs-Celegangs.tsv | sort -u -k1,1 --merge > bestHits

There's a manual in the link too. The flags in blastp are for best homolog detection. These are from a publication, although I can't remember which one..

0
Entering edit mode
7.0 years ago
David Fredman ★ 1.1k

I would suggest calling orthologs and paralogs between your species and C. elegans using the offline version of Inparanoid (by the Sonnhammer lab), which will essentially perform bi-directional Blast, and call orthologs with sensible cutoffs. It's very easy to run, and you can obtain it (by request) here:

http://inparanoid.sbc.su.se/cgi-bin/index.cgi

Other alternatives would include

Proteinortho (https://www.bioinf.uni-leipzig.de/Software/proteinortho/) ,

OrthoMCL (http://orthomcl.org/orthomcl/) OrthoMCL

or mapping your proteins to the pre-calculated orthologous groups in eggNOG (http://eggnog.embl.de/version_4.0.beta/index.html)

0
Entering edit mode
7.0 years ago
Prakki Rama ★ 2.5k

You can also take a look reciprocal smallest distance