I have a fasta file containing many protein sequences with identifier. How can I remove the redundant sequences using HMMscan
Question: Removing redundant sequences
0
kutubjoy • 0 wrote:
ADD COMMENT
• link
•
modified 4.0 years ago
by
Eslam Samir • 100
•
written
5.7 years ago by
kutubjoy • 0
0
nterhoeven • 120 wrote:
I wrote a perl script to do that. It is called remove_duplicates and you can find it here:
0
Kurban • 190 wrote:
Hello guys. I have more than 70,000 protein sequences come from 65 animal species, most of them are TFs. So some of them might be homologous. I like to use CD-HIT to remove the them. But which similarity threshold should I use? Any suggestion?
0
Eslam Samir • 100 wrote:
Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)
It is a very fast program and it can deal with:
- Nucleotide sequences
- Protein sequences
It can work under Operating systems:
- Windows
- Mac
- Linux
It also works for:
- Fasta format
- Fastq format
Best Regards
Please log in to add an answer.
Use of this site constitutes acceptance of our User
Agreement
and Privacy
Policy.
Powered by Biostar
version 2.3.0
Traffic: 1661 users visited in the last hour
why use HMMscan to remove redundant sequences? There are a gazillion deduplication tools out there - just search the forum for FASTA deduplication.