unique sequence IDs from fasta file
1
0
Entering edit mode
8.8 years ago
tcf.hcdg ▴ 70

Dears

I have a fasta sequence file which have some duplicate sequences in it. I want to remove all the duplicates from the file and secondly I wanted to stored these duplicate sequence in another file.

Please guide how can it be possible

Thanks

fasta grep • 7.5k views
ADD COMMENT
0
Entering edit mode

It's not clear from your post: are you wanting to find duplicate sequences or duplicate sequence identifiers? In other words, which of the two lines do you want to check for duplicates in the set below:

>GeneHeader
AAGTCAGCTGATGCTACGAC
ADD REPLY
0
Entering edit mode

I want to find duplicate sequence identifiers.

ADD REPLY
0
Entering edit mode

OK, so you want to remove any duplicated sequence identifiers and their corresponding sequence information from the FASTA file. Then you want to output those duplicated identifiers to a separate file. Each sequence identifier would only be shown one time, regardless of how many times it's duplicated in the FASTA data. Is that correct?

ADD REPLY
0
Entering edit mode

yes absolutely right

ADD REPLY
3
Entering edit mode
8.8 years ago
kloetzl ★ 1.1k
$ cat *.fa* | grep '^>' | sort | uniq -d

This will print all duplicate entries. You can then use this list to extract the duplicate sequences from the file with one of the thousand fasta-manipulation-tools available.

ADD COMMENT
0
Entering edit mode

After getting uniq identifiers, here is what to do.

ADD REPLY

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6