Question

Removing Sequences With More Than 90% Identity In A Protein Fasta File

0

Entering edit mode

11.0 years ago

Pappu ★ 2.1k

I want to remove the sequences which have >90% sequence identity keeping the larger sequence. I am wondering if there is any tool for that.

python • 4.4k views

ADD COMMENT • link updated 11.0 years ago by Frédéric Mahé ★ 3.2k • written 11.0 years ago by Pappu ★ 2.1k

0

Entering edit mode

Say A and B has 90% identity and B is longer; B and C has 90% identity and C is longer. Do you want to remove both A and B?

ADD REPLY • link 11.0 years ago by lh3 33k

0

Entering edit mode

Exactly, I want to remove all the subsets of sequences with >90% identity.

ADD REPLY • link 11.0 years ago by Pappu ★ 2.1k

1

Entering edit mode

I was not clear: in the example about, A and C do not have 90% identity. Because B has been thrown away, you may think A should be kept as it is not within 90% identity to other chosen sequences. Do you still want to remove A? If you want to remove A, that is single-linkage clustering or equivalently to find connected components in a graph. You can find the algorithm on wiki and many other places. It is pretty simple and should be achievable in <50 lines of Perl.

ADD REPLY • link 11.0 years ago by lh3 33k

score 4 · Answer 1 · 2013-05-06

4

Entering edit mode

11.0 years ago

Ido Tamir 5.2k

I just used usearch for nucleotide sequences and I think its quite good. It does single linkage clustering to centroids and reports the centroids and the clusters. Its not open source however.
CD-Hit does similar things. But I have not used it.
I also tried to do the same with vmatch (also not open source), but the clustering was not good

ADD COMMENT • link 11.0 years ago by Ido Tamir 5.2k

0

Entering edit mode

usearch is freely available if you have an academic email address, and I second Ido Tamir's recommendation. I have used the uclust operation for this.

ADD REPLY • link 11.0 years ago by Josh Herr 5.8k

score 4 · Answer 2 · 2013-05-06

4

Entering edit mode

11.0 years ago

Frédéric Mahé ★ 3.2k

T-coffee has a command just for that:

t_coffee -other_pg seq_reformat -in sproteases_large.fasta -action +trim _seq_%%90_