Question

orthologous proteins clustering

0

Entering edit mode

8.7 years ago

biolab ★ 1.4k

Dear all,

I used reciprocal best Blast hit method to find orthologous proteins from many species. My problem is some individual cluster contains too many proteins.

I make an example below. Each pair shows reciprocal BLAST top hits result (take the first line for example, the best hit of species1 protein A is species2 protein B, meanwhile, the best hit of species2 protein B is species1 protein A), then a cluster contains A,B,C,D,E,F,G, because all of these proteins are connected some way. With the number of genes and species increasing, I find some clusters are huge (thousands of proteins within a cluster). I am asking you how to filter the result, that is to make some huge clusters smaller in size?

THANK YOU!

species1 protein A <--> species2 protein B
species1 protein A <--> species3 protein C
species1 protein D <--> species4 protein E
species2 protein B <--> species3 protein F
species2 protein G <--> species4 protein E
species3 protein F <--> species4 protein E

Reciprocal-Best-Hits blast • 2.4k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by biolab ★ 1.4k

1

Entering edit mode

If you believe that reciprocal best hits give you valid orthologs then I see no reason for splitting the groups, the genes in each group are orthologs. If you suspect the method is inaccurate for some reason, then you need to build a proper phylogenetic tree. To deal with the amount of data, you could try building a tree for each cluster. Also you might want to use protein-guided nucleic acid sequence alignments for this.

ADD REPLY • link 8.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi, Jean-Karim Heriche, Thank you for your comments.

ADD REPLY • link 8.7 years ago by biolab ★ 1.4k

0

Entering edit mode

What kind of data do you have? Transcriptome assemblies? Predicted genes from your draft genome assemblies? Genes downloaded from NCBI?

ADD REPLY • link 8.7 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, they are cds sequences downloaded from ENSEMBL.

ADD REPLY • link 8.7 years ago by biolab ★ 1.4k

0

Entering edit mode

In that case, why don't you use the orthology inference from EnsEMBL Compara ?

ADD REPLY • link 8.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your comments. Actually I tried BioMart, but due to large number of species, the orthologous pair dataset is huge, and I cannot download it. That's why I sought to use RBH approach.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by biolab ★ 1.4k

1

Entering edit mode

Use the API then. As an alternative, you can probably also find the same species in TreeFam.