I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) as follows:
cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9
The results of this are one file containing the clusters, and another containing the representative sequences. Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.
Does anyone know what the problem here might be? Am I missing something?
Thanks in advance!