I used reciprocal best Blast hit method to find orthologous proteins from many species. My problem is some individual cluster contains too many proteins.
I make an example below. Each pair shows reciprocal BLAST top hits result (take the first line for example, the best hit of species1 protein A is species2 protein B, meanwhile, the best hit of species2 protein B is species1 protein A), then a cluster contains A,B,C,D,E,F,G, because all of these proteins are connected some way. With the number of genes and species increasing, I find some clusters are huge (thousands of proteins within a cluster). I am asking you how to filter the result, that is to make some huge clusters smaller in size?
species1 protein A <--> species2 protein B species1 protein A <--> species3 protein C species1 protein D <--> species4 protein E species2 protein B <--> species3 protein F species2 protein G <--> species4 protein E species3 protein F <--> species4 protein E