Question: How to make sure there is no duplicate sequence in a fasta file?
0
gravatar for seta
2.8 years ago by
seta1000
Sweden
seta1000 wrote:

Hi all,

I was wondering how if there is a command to make sure there is no duplicate sequence in a fasta file, please put your helpful commands.

Thanks

ADD COMMENTlink modified 2.8 years ago by Alternative210 • written 2.8 years ago by seta1000
4

Already asked and answered here and here and here and probably a few others.

ADD REPLYlink written 2.8 years ago by Jean-Karim Heriche16k
2

lmtgfy (let me google that for you), here.

ADD REPLYlink written 2.8 years ago by Irsan6.6k

Thanks for perfect help. I knew how to remove duplicate sequences, but before trying to do it, I just want to make sure there is duplicate sequence.

ADD REPLYlink written 2.8 years ago by seta1000
1

If there are no duplicate sequence and you use a duplicate remover, the resulting file should be the same so why worry about it?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Irsan6.6k

Just for saving time, because I'm working on the usual laptop and face with a large fasta sequence file.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by seta1000

It's far more productive to just use a dup remover than write a dup detector, so I doubt if anyone has one. Maybe samtools faidx can help.

ADD REPLYlink written 2.8 years ago by Ram15k

Well, Dedupe will simply detect duplicate sequences and not remove them if you don't specify an output file :)

ADD REPLYlink written 2.8 years ago by Brian Bushnell15k
0
gravatar for Alternative
2.8 years ago by
Alternative210
Alternative210 wrote:

1) If your duplicated sequences have the same ID, the following will give the count per record:

grep ">" tmp.fa | uniq -c

2) To get only IDs of duplicated sequences with the same ID (Assuming duplicate records have identical IDs)

 grep ">" tmp.fa | uniq -d

3) Now, if you want to check on the sequences themselves, to be on the safe side, in case you are not sure that duplicated sequences have duplicated IDs, you can use the following awk statement (adjust the output the way you like, i.e by printing  only the counts or only counts > 1 ...)

awk 'BEGIN{ORS="\n";FS="\n";RS=">"}NR>1{REC[substr($0,index($0,$2))]++} END {for(i in REC){print REC[i],i}}' tmp.fa

Hope this will help,

P.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Alternative210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1517 users visited in the last hour