Question

Given Two Fasta Files(DNA)-Remove duplicated sequences(most similar 90%-80%)

2

Entering edit mode

7.7 years ago

malik.yousef ▴ 20

Hello

Given Two Fasta Files(DNA)-How to remove duplicated sequences(most similar 90%-80%)? Or keep one of them at the first file. Which tools to use and how to performa that?

Best Malik

blast sequence • 2.6k views

ADD COMMENT • link updated 7.7 years ago by GenoMax 141k • written 7.7 years ago by malik.yousef ▴ 20

score 2 · Answer 1 · 2016-07-29

2

Entering edit mode

7.7 years ago

shenwei356 8.4k

You can try seqkit:

seqkit rmdup --by-seq --ignore-case --md5    file1.fasta     file2.fasta > clean.fasta

It's very fast!

ADD COMMENT • link 7.7 years ago by shenwei356 8.4k

0

Entering edit mode

Thanks for your reply. I cant run it as i'm using cygwin and getting the fellowing error: -bash: ./seqkit: cannot execute binary file: Exec format error

ADD REPLY • link 7.7 years ago by malik.yousef ▴ 20

0

Entering edit mode

You can run BBMap on a PC. Pure java, no cygwin needed.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

You can download the Windows version ~~~ NO ANY dependencies

seqkit_windows_386.exe.tar.gz or seqkit_windows_amd64.exe.tar.gz

ADD REPLY • link 7.7 years ago by shenwei356 8.4k

0

Entering edit mode

Ok i have it in Windows and its ok. Still i prefer to run it in CygWin...What i should do?

ADD REPLY • link 7.7 years ago by malik.yousef ▴ 20

0

Entering edit mode

Well, it seems that golang could not compile cygwin executable binaries. Both linux and windows, mac os x are supported, but cygwin :(

ADD REPLY • link 7.7 years ago by shenwei356 8.4k

0

Entering edit mode

ok..so this SeqKit rmdup -remove duplicated sequences..how to remove sequences with similarity of let say 90% and above?

ADD REPLY • link 7.7 years ago by malik.yousef ▴ 20

0

Entering edit mode

Try USEARCH, VSEARCH or CD-HIT other clustring softwares.

ADD REPLY • link 7.7 years ago by shenwei356 8.4k

score 0 · Answer 2 · 2016-07-29

dedupe.sh from BBmap. Can be as simple as: dedupe.sh in=<file or stdin> out=<file or stdout>

Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters.