Removal of sequences that is partial of a longer sequences in multifasta file
1
0
Entering edit mode
3.8 years ago
Louis Kok ▴ 30

Hi. I wish to remove any sequence that is partial of a longer sequence in multifasta file. For example, let say I have three sequences below:

>seq1
ACGACGATCGT**ACTAGCATCGAGCGTAC**TACGTAGCGCGT

>seq2
**ACTAGCATCGAGCGTAC**

>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT

seq2 is exactly part of seq1. So after removing the partial (duplicate) sequences, I am expecting to have the following multifasta file:

>seq1
ACGACGATCGTACTAGCATCGAGCGTACTACGTAGCGCGT

>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT

All the answers I managed to search are removal of exact duplicates. Is there any tool or script to achieve the purpose? Thanks in advance.

sequence • 855 views
ADD COMMENT
0
Entering edit mode

You can try program CD-HIT with Sequence Identity Parameter = 1. It will cluster all sequences which are identical and return you longest one for each cluster.

ADD REPLY
0
Entering edit mode
3.8 years ago
Ram 43k

I think the most plain way would be to write a custom script using BioPython. You could create a dict with the identifiers as key and sequence as value, then test if each sequence, starting from the smallest one, is a substring of larger sequences. You can use that to pick relevant sequences and save them to an output file.

ADD COMMENT

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6