Question

how to identify same protien sequences in one fasta file

0

Entering edit mode

8.8 years ago

Kurban ▴ 230

Hello to all, I have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database(http://www.transcriptionfactor.org/index.cgi?Home) and a database of metazoan transcription factors and maternal factors (http://www.bioinformatics.org/regulator/page.php). from the former DB I got ~620 sequences, and 519 sequences from the later one . and then I blasted the sequences of these two file, around 70% of the sequences have the similarity higher than 75% . then I think there might be certain amount of protein sequences in the two file which represent the same transcription factors. I want to use these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here I wanna remove these redundant sequences and keep the unique TFs sequences. I used cd-hit-est, and it did not cluster any of these sequences. now I am going to blast all the sequences with NCBI nr base, and then delete the duplicated ones according their annotations.

My question here is can I do better than this? If I do how? Could you please give me some suggestions here?

Thanks

fasta protien-sequences • 2.1k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by Kurban ▴ 230

0

Entering edit mode

If you have protein sequences, you don't want cd-hit-est which is for nucleotide sequences. cd-hit should work.

ADD REPLY • link 8.8 years ago by Neilfws 49k

Ram · Answer 1 · 2015-07-06

0

Entering edit mode

8.8 years ago

muppetleague ▴ 10

Have you looked at UCLUST? http://www.drive5.com/usearch/manual/cmd_cluster_fast.html

You can specify the identity threshold with -id and a decimal between 0 and 1 (such as -id 0.9)

The filename specified in the -centroids option will contain the unique sequences.

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by muppetleague ▴ 10