Given Two Fasta Files(DNA)-Remove duplicated sequences(most similar 90%-80%)
2
2
Entering edit mode
7.7 years ago
malik.yousef ▴ 20

Hello

Given Two Fasta Files(DNA)-How to remove duplicated sequences(most similar 90%-80%)? Or keep one of them at the first file. Which tools to use and how to performa that?

Best Malik

blast sequence • 2.6k views
ADD COMMENT
2
Entering edit mode
7.7 years ago

You can try seqkit:

seqkit rmdup --by-seq --ignore-case --md5    file1.fasta     file2.fasta > clean.fasta

It's very fast!

ADD COMMENT
0
Entering edit mode

Thanks for your reply. I cant run it as i'm using cygwin and getting the fellowing error: -bash: ./seqkit: cannot execute binary file: Exec format error

ADD REPLY
0
Entering edit mode

You can run BBMap on a PC. Pure java, no cygwin needed.

ADD REPLY
0
Entering edit mode

You can download the Windows version ~~~ NO ANY dependencies

seqkit_windows_386.exe.tar.gz or seqkit_windows_amd64.exe.tar.gz

ADD REPLY
0
Entering edit mode

Ok i have it in Windows and its ok. Still i prefer to run it in CygWin...What i should do?

ADD REPLY
0
Entering edit mode

Well, it seems that golang could not compile cygwin executable binaries. Both linux and windows, mac os x are supported, but cygwin :(

ADD REPLY
0
Entering edit mode

ok..so this SeqKit rmdup -remove duplicated sequences..how to remove sequences with similarity of let say 90% and above?

ADD REPLY
0
Entering edit mode

Try USEARCH, VSEARCH or CD-HIT other clustring softwares.

ADD REPLY
0
Entering edit mode
7.7 years ago
GenoMax 141k

dedupe.sh from BBmap. Can be as simple as: dedupe.sh in=<file or stdin> out=<file or stdout>

Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters.

ADD COMMENT

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6