Removing Sequences With More Than 90% Identity In A Protein Fasta File
2
0
Entering edit mode
11.0 years ago
Pappu ★ 2.1k

I want to remove the sequences which have >90% sequence identity keeping the larger sequence. I am wondering if there is any tool for that.

python • 4.4k views
ADD COMMENT
0
Entering edit mode

Say A and B has 90% identity and B is longer; B and C has 90% identity and C is longer. Do you want to remove both A and B?

ADD REPLY
0
Entering edit mode

Exactly, I want to remove all the subsets of sequences with >90% identity.

ADD REPLY
1
Entering edit mode

I was not clear: in the example about, A and C do not have 90% identity. Because B has been thrown away, you may think A should be kept as it is not within 90% identity to other chosen sequences. Do you still want to remove A? If you want to remove A, that is single-linkage clustering or equivalently to find connected components in a graph. You can find the algorithm on wiki and many other places. It is pretty simple and should be achievable in <50 lines of Perl.

ADD REPLY
4
Entering edit mode
11.0 years ago
Ido Tamir 5.2k
  • I just used usearch for nucleotide sequences and I think its quite good. It does single linkage clustering to centroids and reports the centroids and the clusters. Its not open source however.
  • CD-Hit does similar things. But I have not used it.
  • I also tried to do the same with vmatch (also not open source), but the clustering was not good
ADD COMMENT
0
Entering edit mode

usearch is freely available if you have an academic email address, and I second Ido Tamir's recommendation. I have used the uclust operation for this.

ADD REPLY
4
Entering edit mode
11.0 years ago

T-coffee has a command just for that:

t_coffee -other_pg seq_reformat -in sproteases_large.fasta -action +trim _seq_%%90_
ADD COMMENT
0
Entering edit mode

t_coffee's option removes one of a pair, not both.

ADD REPLY

Login before adding your answer.

Traffic: 2693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6