Question

unique sequence IDs from fasta file

0

Entering edit mode

9.3 years ago

tcf.hcdg ▴ 70

Dears

I have a fasta sequence file which have some duplicate sequences in it. I want to remove all the duplicates from the file and secondly I wanted to stored these duplicate sequence in another file.

Please guide how can it be possible

Thanks

fasta grep • 7.8k views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

It's not clear from your post: are you wanting to find duplicate sequences or duplicate sequence identifiers? In other words, which of the two lines do you want to check for duplicates in the set below:

>GeneHeader
AAGTCAGCTGATGCTACGAC

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by Dan D 7.4k

0

Entering edit mode

I want to find duplicate sequence identifiers.

ADD REPLY • link 9.3 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

OK, so you want to remove any duplicated sequence identifiers and their corresponding sequence information from the FASTA file. Then you want to output those duplicated identifiers to a separate file. Each sequence identifier would only be shown one time, regardless of how many times it's duplicated in the FASTA data. Is that correct?

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by Dan D 7.4k

0

Entering edit mode

yes absolutely right

ADD REPLY • link 9.3 years ago by tcf.hcdg ▴ 70

Ram · Answer 1 · 2015-07-27

3

Entering edit mode

9.3 years ago

kloetzl ★ 1.1k

$ cat *.fa* | grep '^>' | sort | uniq -d

This will print all duplicate entries. You can then use this list to extract the duplicate sequences from the file with one of the thousand fasta-manipulation-tools available.

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by kloetzl ★ 1.1k

0

Entering edit mode

After getting uniq identifiers, here is what to do.

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by venu 7.1k