Question

How to make sure there is no duplicate sequence in a fasta file?

0

Entering edit mode

9.8 years ago

seta ★ 1.9k

Hi all,

I was wondering how if there is a command to make sure there is no duplicate sequence in a fasta file, please put your helpful commands.

Thanks

fasta sequence • 11k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by seta ★ 1.9k

4

Entering edit mode

Already asked and answered here and here and here and probably a few others.

ADD REPLY • link 9.8 years ago by Jean-Karim Heriche 27k

2

Entering edit mode

lmtgfy (let me google that for you), here.

ADD REPLY • link 9.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks for perfect help. I knew how to remove duplicate sequences, but before trying to do it, I just want to make sure there is duplicate sequence.

ADD REPLY • link 9.8 years ago by seta ★ 1.9k

1

Entering edit mode

If there are no duplicate sequence and you use a duplicate remover, the resulting file should be the same so why worry about it?

ADD REPLY • link 9.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

Just for saving time, because I'm working on the usual laptop and face with a large fasta sequence file.

ADD REPLY • link 9.8 years ago by seta ★ 1.9k

0

Entering edit mode

It's far more productive to just use a dup remover than write a dup detector, so I doubt if anyone has one. Maybe samtools faidx can help.

ADD REPLY • link 9.8 years ago by Ram 45k

0

Entering edit mode

Well, Dedupe will simply detect duplicate sequences and not remove them if you don't specify an output file :)

ADD REPLY • link 9.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello, You can use EMBOSS application skipredundant Good luck!

ADD REPLY • link 5.6 years ago by an.gorohovski • 0

Ram · Answer 1 · 2015-09-16

0

Entering edit mode

9.8 years ago

Alternative ▴ 290

If your duplicated sequences have the same ID, the following will give the count per record:
```
grep ">" tmp.fa | sort | uniq -c
```
To get only IDs of duplicated sequences with the same ID (Assuming duplicate records have identical IDs)
```
grep ">" tmp.fa | sort | uniq -d
```
Now, if you want to check on the sequences themselves, to be on the safe side, in case you are not sure that duplicated sequences have duplicated IDs, you can use the following awk statement (adjust the output the way you like, i.e by printing only the counts or only counts > 1 ...)
```
awk 'BEGIN{ORS="\n";FS="\n";RS=">"}NR>1{REC[substr($0,index($0,$2))]++} END {for(i in REC){print REC[i],i}}' tmp.fa
```

Hope this will help,

P.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.8 years ago by Alternative ▴ 290

1

Entering edit mode

grep ">" tmp.fa | sort | uniq -c

ADD REPLY • link 5.6 years ago by Fatima ▴ 1000

0

Entering edit mode

Thanks Afagh for the correction. Indeed, sort is mandatory in that case. Corrected

ADD REPLY • link 5.6 years ago by Alternative ▴ 290

score 0 · Answer 2 · 2019-12-06

0

Entering edit mode

5.6 years ago

an.gorohovski • 0

Hello, You can use EMBOSS app. skipredundant Good luck!

ADD COMMENT • link 5.6 years ago by an.gorohovski • 0