I have assembled three transcriptomes of a non-model plant and have been writing up a report. Initially, I blastx and blastp [E-value 1e-5] queried the unigenes and coded for proteins against the entire collection of PlantTFDB protein sequences.
Upon analyzing the unigene blastx and blastp hits I came to the realization that I was getting way too many members of each of the 58 transcription factor families. For example ~3000 unigenes were annotated to bHLH for one of my assemblies, however according to the PlantTFDB species summary (http://planttfdb.cbi.pku.edu.cn/family.php?fam=bHLH) for this family the highest number of bHLH genes identified in one species was 559 (Panicum virgatum).
I have since then filtered the blastx and blastp results at an E-value of 1e-50 (as 1e-5 in hindsight was way too low) and >35% ID. This reduced the number of bHLH annotated unigenes to ~1000, but I suspect this is still too high of an estimate.
I have also been able to generate percent hit coverage stats for the blast results and was thinking that I could similarly filter the results to include hits above some percent hit coverage threshold.
Any suggestions on an alternative approach or a percent hit coverage threshold to filter with would be much appreciated.