Question: how to identify same protien sequences in one fasta file
3.7 years ago
china/Urumqi/xinjiang academy of animal scinces
Kurban170 wrote:

hello to all,

i have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database( and a database of metazoan transcription factors and maternal factors( from the former DB i got ~620 sequences, and 519 sequences from the later one . and then i blasted the sequences of  these two file, around 70% of the sequences  have the similarity higher than 75% . then i think there might be certain amount of protein sequences in the two file which represent the same transcription factors.

i want to use  these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here i wanna remove these redundant sequences  and keep the unique TFs sequences . i used cd-hit-est, and it did not cluster any of these sequences. now i am going to blast all the sequences with NCBI nr base , and then delete the duplicated ones according their annotations. 

my question here is   can i do better than this ? if i do how ? could you please give me some suggestions here?


If you have protein sequences, you don't want cd-hit-est which is for nucleotide sequences. cd-hit should work.

Neilfws
3.7 years ago
United States
muppetleague10 wrote:

Have you looked at UCLUST?

You can specify the identity threshold with -id and a decimal between 0 and 1 (such as -id 0.9)

The filename specified in the -centroids option will contain the unique sequences. 

