Question: Remove redundancy from GenBank plasmid database using cd-hit-est
0
gravatar for wanderingstefan
2.5 years ago by
wanderingstefan30 wrote:

Hey all,

I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) as follows: cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9

The results of this are one file containing the clusters, and another containing the representative sequences. Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.

Does anyone know what the problem here might be? Am I missing something?

Thanks in advance!

alignment next-gen sequence • 910 views
ADD COMMENTlink modified 2.5 years ago by 5heikki7.8k • written 2.5 years ago by wanderingstefan30
2
gravatar for 5heikki
2.5 years ago by
5heikki7.8k
Finland
5heikki7.8k wrote:

The problem is that you did not bother to check what the default options are.

   -g   1 or 0, default 0
    by cd-hit's default algorithm, a sequence is clustered to the first 
    cluster that meet the threshold (fast cluster). If set to 1, the program
    will cluster it into the most similar cluster that meet the threshold
    (accurate but slow mode)
ADD COMMENTlink written 2.5 years ago by 5heikki7.8k
1

Hey, thanks for your answer. However, running it like cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9 -g 1 does not resolve my problem. my clustering file still looks like this:

>Cluster 39
0   6222nt, >gi|410475454|ref|NC... *
>Cluster 40
0   6211nt, >gi|387504713|ref|NC... at +/98.10%
1   6222nt, >gi|41056918|ref|NC_... *
2   6222nt, >gi|118480566|ref|NC... at +/98.09%
>Cluster 41
0   6222nt, >gi|844749291|ref|NZ... *

The sequences that are 6222 bases long are at least 99% similar over the whole length, but still end up in different clusters..

ADD REPLYlink written 2.5 years ago by wanderingstefan30
2

From those sequences only cluster 40 members are within 95% similarity over cd-hit-est default alignment coverage cutoffs.

Let's have a look with blastn:

blastn -query 410475454.fna -subject 844749291.fna -outfmt 6
gi|410475454|ref|NC_019040.1|   gi|844749291|ref|NZ_CP006639.1| 100.000 4693    0       0       1530    6222    1       4693    0.0     8667
gi|410475454|ref|NC_019040.1|   gi|844749291|ref|NZ_CP006639.1| 99.935  1529    1       0       1       1529    4694    6222    0.0     2819

The sequences are indeed very similar. However, their linear representations begin from completely different locations! I don't think any clustering algorithm considers circular topology as an option..

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by 5heikki7.8k
1

Ah, now I see! I just looked at the graphical output of blast, but was not aware that the slash in the middle of the sequence marked the beginning of the alignment! Then I know why they are in different clusters. Thank you for your answer!

ADD REPLYlink written 2.5 years ago by wanderingstefan30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1867 users visited in the last hour