Question: Removing Sequences With More Than 90% Identity In A Protein Fasta File
0
gravatar for Pappu
4.2 years ago by
Pappu1.7k
Pappu1.7k wrote:

I want to remove the sequences which have >90% sequence identity keeping the larger sequence. I am wondering if there is any tool for that.

python • 1.5k views
ADD COMMENTlink modified 4.2 years ago by Frédéric Mahé2.6k • written 4.2 years ago by Pappu1.7k

Say A and B has 90% identity and B is longer; B and C has 90% identity and C is longer. Do you want to remove both A and B?

ADD REPLYlink written 4.2 years ago by lh330k

Exactly, I want to remove all the subsets of sequences with >90% identity.

ADD REPLYlink written 4.2 years ago by Pappu1.7k
1

I was not clear: in the example about, A and C do not have 90% identity. Because B has been thrown away, you may think A should be kept as it is not within 90% identity to other chosen sequences. Do you still want to remove A? If you want to remove A, that is single-linkage clustering or equivalently to find connected components in a graph. You can find the algorithm on wiki and many other places. It is pretty simple and should be achievable in <50 lines of Perl.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by lh330k
4
gravatar for Ido Tamir
4.2 years ago by
Ido Tamir4.7k
Austria
Ido Tamir4.7k wrote:
  • I just used usearch for nucleotide sequences and I think its quite good. It does single linkage clustering to centroids and reports the centroids and the clusters. Its not open source however.
  • CD-Hit does similar things. But I have not used it.
  • I also tried to do the same with vmatch (also not open source), but the clustering was not good
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Ido Tamir4.7k

usearch is freely available if you have an academic email address, and I second Ido Tamir's recommendation. I have used the uclust operation for this.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Josh Herr5.4k
4
gravatar for Frédéric Mahé
4.2 years ago by
Kaiserslautern, Germany
Frédéric Mahé2.6k wrote:

T-coffee has a command just for that:

t_coffee -other_pg seq_reformat -in sproteases_large.fasta -action +trim _seq_%%90_
ADD COMMENTlink written 4.2 years ago by Frédéric Mahé2.6k

t_coffee's option removes one of a pair, not both.

ADD REPLYlink written 2.3 years ago by ddofer30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1180 users visited in the last hour