hello to all,
i have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database(http://www.transcriptionfactor.org/index.cgi?Home) and a database of metazoan transcription factors and maternal factors(http://www.bioinformatics.org/regulator/page.php). from the former DB i got ~620 sequences, and 519 sequences from the later one . and then i blasted the sequences of these two file, around 70% of the sequences have the similarity higher than 75% . then i think there might be certain amount of protein sequences in the two file which represent the same transcription factors.
i want to use these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here i wanna remove these redundant sequences and keep the unique TFs sequences . i used cd-hit-est, and it did not cluster any of these sequences. now i am going to blast all the sequences with NCBI nr base , and then delete the duplicated ones according their annotations.
my question here is can i do better than this ? if i do how ? could you please give me some suggestions here?