Remove redundant nucleotide sequences in a FASTA file containing ambiguous N
1
0
Entering edit mode
3.2 years ago
wang9ting • 0

Is there any tool for removing redundant sequences in the scenario where "N" stands for A or C or G or T?

For example:

>seq1
ACNTACNT
>seq2
ACGTACGT
>seq3
ACGTACCT
>seq4
ACTTACAT
>seq5
TCGTACTT
>seq6
TCGTACTT
...

seq2, seq3, and seq4 should be removed because they are redundant sequences of seq1. And seq6 should be removed because it's a redundant sequence of seq5.

The function could be used for all IUPAC ambiguity codes, but now I'm only interested in having "N" and A,C,G,T in sequences.

Thanks!

FASTA redundant • 1.3k views
ADD COMMENT
0
Entering edit mode

Have you tried searching online? This question has been addressed numerous times.

ADD REPLY
0
Entering edit mode

Thanks for reaching out. Yes, I have tried searching, but could not find a good solution. Pierre Lindenbaum's code works well.

ADD REPLY
0
Entering edit mode

Ah, you have Ns - sorry I didn't see that before.

ADD REPLY
2
Entering edit mode
3.2 years ago

my java solution:

ADD COMMENT
0
Entering edit mode

Many thanks! It works well.

ADD REPLY

Login before adding your answer.

Traffic: 3612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6