Removing redundant sequences
2
0
Entering edit mode
8.9 years ago
kutubjoy • 0

I have a fasta file containing many protein sequences with identifier. How can I remove the redundant sequences using HMMscan

sequence alignment blast • 5.2k views
ADD COMMENT
1
Entering edit mode

Why use HMMscan to remove redundant sequences? There are a gazillion deduplication tools out there - just search the forum for FASTA deduplication.

ADD REPLY
0
Entering edit mode

Hello guys

I have more than 70,000 protein sequences come from 65 animal species, most of them are TFs. So some of them might be homologous. I like to use CD-HIT to remove the them. But which similarity threshold should I use?

Any suggestion?

ADD REPLY
0
Entering edit mode

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD REPLY
1
Entering edit mode

I see that you've created a Tool type post for your tool. Please do not spam threads with ads for your tool.

ADD REPLY
0
Entering edit mode
8.8 years ago
nterhoeven ▴ 120

I wrote a perl script to do that. It is called remove_duplicates and you can find it here:

https://github.com/nterhoeven/sequence_processing

ADD COMMENT
0
Entering edit mode
8.8 years ago
h.mon 35k

As Ram suggested, there are several tools to do this, one is cd-hit.

ADD COMMENT
0
Entering edit mode

This is an old question :)

ADD REPLY
1
Entering edit mode

Not for my standards, I even replied to 3-4 year old questions ;-)

ADD REPLY

Login before adding your answer.

Traffic: 2526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6