Question: Blast unigenes with set of protein sequences
Hello guys, I have more than 10,000 de novo assembled unigenes from RNA-seq, and blasted them against 95 protein sequences from another insect species get 454 blast results. But their similarity range is about 20%-100%; e-value range 8.00E-06 - 0. When I want to select the probable homologous from these blast results what should be the cut-offs for similarity and e-value?
This question comes up often and there is no defined cutoff that designates a homolog. A gene is a homolog or it is not. On the other hand similarity is expressed in %. A sequence could still be homologous (with a low % similarity) if it is evolutionarily far apart. If your insects species are closely related then 20% similarily may be low but if they are not then 20% could still be an important data point.

As you are well aware blast E-values are dependent on size of the database which in this case is very small. Was there a reason to only select those 95 proteins? 454 genes that you have a blast result are similar to some extent with your target gene set and you would need to examine the entire lot to see if you can remove some redundancy.

