Question: Clustering large genomes with psi-cd-hit giving a different similarity value to blast
gravatar for ang.mccann
2.4 years ago by
ang.mccann0 wrote:

Hi, I am trying to cluster bacterial genomes with psi-cd-hit but I have some results I can't understand. I run the command as follows

./ -i Genomes.fasta -o Genomes.fasta.k.out. -c 0.8 -G 1 -g 1 -prog blastn -circle 1

However, when I check the clustered output file the following is a snippet from my results

>Cluster 3
0       991702aa, >NC_018077.1... *
1       991702aa, >CP002058.1.. at 0.0/991702aa/100%
2       977322aa, >NZ_CP007589.1.. at 0.0/672875:123109:1407:1041:348:978:346:249:90:52aa/81.82%

So sequences 0 and 1 are identical, which is correct, and sequences 0 and 2 are 82% similar according to psi-cd-hit. However, based on blast they are about 98% identical. Here is a snippet of the blast output

blastn -query a.fasta  -subject b.fasta  -outfmt 6

 NC_018077.1    NZ_CP007589.1   99.99   672875  31  34  181458  854328  181375  854215  0   1.24E+06
 NC_018077.1    NZ_CP007589.1   99.99   181393  4   8   1   181393  1   181385  0   3.35E+05
 NC_018077.1    NZ_CP007589.1   99.98   123109  11  6   868600  991702  854216  977322  0   2.27E+05
 NC_018077.1    NZ_CP007589.1   99.69   4905    10  5   708933  713835  715477  720378  0   8970
 NC_018077.1    NZ_CP007589.1   99.65   4906    10  6   715583  720483  708827  713730  0   8959

So, although I didn't see it in the manual I believe the numbers after each genome in the psi-cd-hit cluster output are the segments of sites that pass the -c threshold i.e.

>NZ_CP007589.1.. at 0.0/672875:123109:1407:1041:348:978:346:249:90:52aa/81.82%`

means a block of 672875 passed, followed by a block of 123109 sites etc and if I sum these sites and divide by the genomes length I get 82%. When I look to the blast output it also shows a block of length 672875 at 99% identity and a block of 123109 but the block of length 181393 is not present in the psi-cd-hit output and may be accounting for the lower score (82% by psi-cd-hit). Am I missing something. I have seen a similar question on this Remove redundancy from GenBank plasmid database using cd-hit-est but this was using cd-hit-est and I am using the -circle option with psi-cd-hit Can anyone help Thanks

clustering blast genome • 1.2k views
ADD COMMENTlink modified 2.4 years ago by h.mon29k • written 2.4 years ago by ang.mccann0

Hi, I am trying to cluster bacterial genomes with psi-cd-hit but I have some results I can't understand.

Are you sure about what you are doing? From the documentation:

  • psi-cd-hit Cluster proteins at <40% cutoff

Ok, sorry, just read the docs more carefully, and it is indeed an envisaged use case.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by h.mon29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1876 users visited in the last hour