Question

Remove Duplicate Reads From Fasta File

0

Entering edit mode

11.8 years ago

deepthithomaskannan ▴ 380

Hi all,

I want to remove duplicate reads from my fasta file. I tried to use fastx_collapser. But since my reads contains lowercase letters and hyphens it failed.

Please help.

Thanks,D.

fasta read • 11k views

ADD COMMENT • link updated 7.4 years ago by Eslam Samir ▴ 110 • written 11.8 years ago by deepthithomaskannan ▴ 380

0

Entering edit mode

duplicate of:

How to remove the same sequences in the FASTA files?

ADD REPLY • link 11.8 years ago by Pierre Lindenbaum 163k

1

Entering edit mode

It's like everybody wants to remove duplicates here!

ADD REPLY • link 11.8 years ago by Manu Prestat 4.1k

score 3 · Answer 1 · 2012-09-28

3

Entering edit mode

11.8 years ago

Manu Prestat 4.1k

Try the sequniq tool from the GenomeTools suite:

gt sequniq -o output.fasta input.fasta

ADD COMMENT • link 11.8 years ago by Manu Prestat 4.1k

0

Entering edit mode

i tried this command, plz could u tell how this command applied...

ADD REPLY • link 9.1 years ago by Kumar ▴ 170

score 1 · Answer 2 · 2012-09-28

1

Entering edit mode

11.8 years ago

Rm 8.3k

Try CD-hit or Uclust

You can remove unwanted hyphens and convert to uppercase using sed:

echo FaSta-TEst | sed "s/-//g ; s/(.*)/\U&/g"

ADD COMMENT • link 11.8 years ago by Rm 8.3k

0

Entering edit mode

Or just tr: echo FaSta-TEst | tr -d - | tr 'a-z' 'A-Z'

ADD REPLY • link 11.8 years ago by Ketil 4.1k

score 0 · Answer 3 · 2017-02-18

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

Best Regards