Entering edit mode
3.2 years ago
wang9ting
•
0
Is there any tool for removing redundant sequences in the scenario where "N" stands for A or C or G or T?
For example:
>seq1
ACNTACNT
>seq2
ACGTACGT
>seq3
ACGTACCT
>seq4
ACTTACAT
>seq5
TCGTACTT
>seq6
TCGTACTT
...
seq2, seq3, and seq4 should be removed because they are redundant sequences of seq1. And seq6 should be removed because it's a redundant sequence of seq5.
The function could be used for all IUPAC ambiguity codes, but now I'm only interested in having "N" and A,C,G,T in sequences.
Thanks!
Have you tried searching online? This question has been addressed numerous times.
Thanks for reaching out. Yes, I have tried searching, but could not find a good solution. Pierre Lindenbaum's code works well.
Ah, you have
N
s - sorry I didn't see that before.