Question: orthologous proteins clustering
0
gravatar for biolab
5.3 years ago by
biolab1.2k
biolab1.2k wrote:

Dear all,

I used reciprocal best Blast hit method to find orthologous proteins from many species.  My problem is some individual cluster contains too many proteins.

I make an example below. Each pair shows reciprocal BLAST top hits result (take the first line for example, the best hit of species1 protein A is species2 protein B, meanwhile, the best hit of species2 protein B is species1 protein A), then a cluster contains A,B,C,D,E,F,G, because all of these proteins are connected some way.  With the number of genes and species increasing, I find some clusters are huge (thousands of proteins within a cluster).  I am asking you how to filter the result, that is to make some huge clusters smaller in size?

THANK YOU!

species1 protein A <--> species2 protein B
species1 protein A <--> species3 protein C
species1 protein D <--> species4 protein E
species2 protein B <--> species3 protein F
species2 protein G <--> species4 protein E
species3 protein F <--> species4 protein E

 

reciprocal best hits blast • 1.5k views
ADD COMMENTlink written 5.3 years ago by biolab1.2k
1

If you believe that reciprocal best hits give you valid orthologs then I see no reason for splitting the groups, the genes in each group are orthologs. If you suspect the method is inaccurate for some reason, then you need to build a proper phylogenetic tree. To deal with the amount of data, you could try building a tree for each cluster. Also you might want to use protein-guided nucleic acid sequence alignments for this.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Jean-Karim Heriche24k

Hi, Jean-Karim Heriche, Thank you for your comments.

ADD REPLYlink written 5.3 years ago by biolab1.2k

What kind of data do you have? Transcriptome assemblies? Predicted genes from your draft genome assemblies? Genes downloaded from NCBI?

ADD REPLYlink written 5.3 years ago by h.mon31k

Hi h.mon, they are cds sequences downloaded from ENSEMBL.

ADD REPLYlink written 5.3 years ago by biolab1.2k

In that case, why don't you use the orthology inference from EnsEMBL Compara ?

ADD REPLYlink written 5.3 years ago by Jean-Karim Heriche24k

Thanks for your comments.  Actually I tried BioMart, but due to large number of species, the orthologous pair dataset is huge, and I cannot download it.  That's why I sought to use RBH approach.

ADD REPLYlink written 5.3 years ago by biolab1.2k
1

Use the API then. As an alternative, you can probably also find the same species in TreeFam.

ADD REPLYlink written 5.3 years ago by Jean-Karim Heriche24k

Thanks a lot, Jean-Karim Heriche.

ADD REPLYlink written 5.3 years ago by biolab1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2071 users visited in the last hour