how to identify same protien sequences in one fasta file
1
0
Entering edit mode
8.8 years ago
Kurban ▴ 230

Hello to all, I have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database(http://www.transcriptionfactor.org/index.cgi?Home) and a database of metazoan transcription factors and maternal factors (http://www.bioinformatics.org/regulator/page.php). from the former DB I got ~620 sequences, and 519 sequences from the later one . and then I blasted the sequences of these two file, around 70% of the sequences have the similarity higher than 75% . then I think there might be certain amount of protein sequences in the two file which represent the same transcription factors. I want to use these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here I wanna remove these redundant sequences and keep the unique TFs sequences. I used cd-hit-est, and it did not cluster any of these sequences. now I am going to blast all the sequences with NCBI nr base, and then delete the duplicated ones according their annotations.

My question here is can I do better than this? If I do how? Could you please give me some suggestions here?

Thanks

fasta protien-sequences • 2.1k views
ADD COMMENT
0
Entering edit mode

If you have protein sequences, you don't want cd-hit-est which is for nucleotide sequences. cd-hit should work.

ADD REPLY
0
Entering edit mode
8.8 years ago
muppetleague ▴ 10

Have you looked at UCLUST? http://www.drive5.com/usearch/manual/cmd_cluster_fast.html

You can specify the identity threshold with -id and a decimal between 0 and 1 (such as -id 0.9)

The filename specified in the -centroids option will contain the unique sequences.

ADD COMMENT

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6