Question: Removal of sequences that is partial of a longer sequences in multifasta file
gravatar for Louis Kok
4 months ago by
Louis Kok20
Louis Kok20 wrote:

Hi. I wish to remove any sequence that is partial of a longer sequence in multifasta file. For example, let say I have three sequences below:




seq2 is exactly part of seq1. So after removing the partial (duplicate) sequences, I am expecting to have the following multifasta file:



All the answers I managed to search are removal of exact duplicates. Is there any tool or script to achieve the purpose? Thanks in advance.

sequence • 130 views
ADD COMMENTlink modified 4 months ago by RamRS30k • written 4 months ago by Louis Kok20

You can try program CD-HIT with Sequence Identity Parameter = 1. It will cluster all sequences which are identical and return you longest one for each cluster.

ADD REPLYlink written 4 months ago by Chirag Parsania1.9k
gravatar for RamRS
4 months ago by
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

I think the most plain way would be to write a custom script using BioPython. You could create a dict with the identifiers as key and sequence as value, then test if each sequence, starting from the smallest one, is a substring of larger sequences. You can use that to pick relevant sequences and save them to an output file.

ADD COMMENTlink written 4 months ago by RamRS30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1161 users visited in the last hour