I have a fasta sequence file which have some duplicate sequences in it. I want to remove all the duplicates from the file and secondly I wanted to stored these duplicate sequence in another file.
Please guide how can it be possible
It's not clear from your post: are you wanting to find duplicate sequences or duplicate sequence identifiers? In other words, which of the two lines do you want to check for duplicates in the set below:
I want to find duplicate sequence identifiers.
OK, so you want to remove any duplicated sequence identifiers and their corresponding sequence information from the FASTA file. Then you want to output those duplicated identifiers to a separate file. Each sequence identifier would only be shown one time, regardless of how many times it's duplicated in the FASTA data. Is that correct?
yes absolutely right
$ cat *.fa* | grep '^>' | sort | uniq -d
This will print all duplicate entries. You can then use this list to extract the duplicate sequences from the file with one of the thousand fasta-manipulation-tools available.
After getting uniq identifiers, here is what to do.
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy