Question: how to identify same protien sequences in one fasta file
gravatar for Kurban
3.7 years ago by
china/Urumqi/xinjiang academy of animal scinces
Kurban170 wrote:

hello to all,

i have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database( and a database of metazoan transcription factors and maternal factors( from the former DB i got ~620 sequences, and 519 sequences from the later one . and then i blasted the sequences of  these two file, around 70% of the sequences  have the similarity higher than 75% . then i think there might be certain amount of protein sequences in the two file which represent the same transcription factors.

i want to use  these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here i wanna remove these redundant sequences  and keep the unique TFs sequences . i used cd-hit-est, and it did not cluster any of these sequences. now i am going to blast all the sequences with NCBI nr base , and then delete the duplicated ones according their annotations. 

my question here is   can i do better than this ? if i do how ? could you please give me some suggestions here?


ADD COMMENTlink modified 3.7 years ago by muppetleague10 • written 3.7 years ago by Kurban170

If you have protein sequences, you don't want cd-hit-est which is for nucleotide sequences. cd-hit should work.

ADD REPLYlink written 3.7 years ago by Neilfws48k
gravatar for muppetleague
3.7 years ago by
United States
muppetleague10 wrote:

Have you looked at UCLUST?

You can specify the identity threshold with -id and a decimal between 0 and 1 (such as -id 0.9)

The filename specified in the -centroids option will contain the unique sequences. 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by muppetleague10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2415 users visited in the last hour