Question: Removal of sequences that is partial of a longer sequences in multifasta file
0
gravatar for Louis Kok
4 months ago by
Louis Kok20
Singapore
Louis Kok20 wrote:

Hi. I wish to remove any sequence that is partial of a longer sequence in multifasta file. For example, let say I have three sequences below:

>seq1
ACGACGATCGT**ACTAGCATCGAGCGTAC**TACGTAGCGCGT

>seq2
**ACTAGCATCGAGCGTAC**

>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT

seq2 is exactly part of seq1. So after removing the partial (duplicate) sequences, I am expecting to have the following multifasta file:

>seq1
ACGACGATCGTACTAGCATCGAGCGTACTACGTAGCGCGT

>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT

All the answers I managed to search are removal of exact duplicates. Is there any tool or script to achieve the purpose? Thanks in advance.

sequence • 130 views
ADD COMMENTlink modified 4 months ago by RamRS30k • written 4 months ago by Louis Kok20

You can try program CD-HIT with Sequence Identity Parameter = 1. It will cluster all sequences which are identical and return you longest one for each cluster.

ADD REPLYlink written 4 months ago by Chirag Parsania1.9k
0
gravatar for RamRS
4 months ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

I think the most plain way would be to write a custom script using BioPython. You could create a dict with the identifiers as key and sequence as value, then test if each sequence, starting from the smallest one, is a substring of larger sequences. You can use that to pick relevant sequences and save them to an output file.

ADD COMMENTlink written 4 months ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1161 users visited in the last hour