Question: Given Two Fasta Files(DNA)-Remove duplicated sequences(most similar 90%-80%)
2
gravatar for malik.yousef
2.7 years ago by
malik.yousef20
malik.yousef20 wrote:

Hello

Given Two Fasta Files(DNA)-How to remove duplicated sequences(most similar 90%-80%)? Or keep one of them at the first file. Which tools to use and how to performa that?

Best Malik

blast sequence • 968 views
ADD COMMENTlink modified 2.7 years ago by genomax65k • written 2.7 years ago by malik.yousef20
2
gravatar for shenwei356
2.7 years ago by
shenwei3564.6k
China
shenwei3564.6k wrote:

You can try seqkit:

seqkit rmdup --by-seq --ignore-case --md5    file1.fasta     file2.fasta > clean.fasta

It's very fast!

ADD COMMENTlink written 2.7 years ago by shenwei3564.6k

Thanks for your reply. I cant run it as i'm using cygwin and getting the fellowing error: -bash: ./seqkit: cannot execute binary file: Exec format error

ADD REPLYlink written 2.7 years ago by malik.yousef20

You can run BBMap on a PC. Pure java, no cygwin needed.

ADD REPLYlink written 2.7 years ago by genomax65k

You can download the Windows version ~~~ NO ANY dependencies

seqkit_windows_386.exe.tar.gz or seqkit_windows_amd64.exe.tar.gz

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by shenwei3564.6k

Ok i have it in Windows and its ok. Still i prefer to run it in CygWin...What i should do?

ADD REPLYlink written 2.7 years ago by malik.yousef20

Well, it seems that golang could not compile cygwin executable binaries. Both linux and windows, mac os x are supported, but cygwin :(

ADD REPLYlink written 2.7 years ago by shenwei3564.6k

ok..so this SeqKit rmdup -remove duplicated sequences..how to remove sequences with similarity of let say 90% and above?

ADD REPLYlink written 2.7 years ago by malik.yousef20

Try USEARCH, VSEARCH or CD-HIT other clustring softwares.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by shenwei3564.6k
0
gravatar for genomax
2.7 years ago by
genomax65k
United States
genomax65k wrote:

dedupe.sh from BBmap. Can be as simple as: dedupe.sh in=<file or stdin> out=<file or stdout>

Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour