Hi, I am trying to cluster bacterial genomes with psi-cd-hit but I have some results I can't understand. I run the command as follows
./psi-cd-hit.pl -i Genomes.fasta -o Genomes.fasta.k.out. -c 0.8 -G 1 -g 1 -prog blastn -circle 1
However, when I check the clustered output file the following is a snippet from my results
>Cluster 3
0 991702aa, >NC_018077.1... *
1 991702aa, >CP002058.1.. at 0.0/991702aa/100%
2 977322aa, >NZ_CP007589.1.. at 0.0/672875:123109:1407:1041:348:978:346:249:90:52aa/81.82%
So sequences 0 and 1 are identical, which is correct, and sequences 0 and 2 are 82% similar according to psi-cd-hit. However, based on blast they are about 98% identical. Here is a snippet of the blast output
blastn -query a.fasta -subject b.fasta -outfmt 6
NC_018077.1 NZ_CP007589.1 99.99 672875 31 34 181458 854328 181375 854215 0 1.24E+06
NC_018077.1 NZ_CP007589.1 99.99 181393 4 8 1 181393 1 181385 0 3.35E+05
NC_018077.1 NZ_CP007589.1 99.98 123109 11 6 868600 991702 854216 977322 0 2.27E+05
NC_018077.1 NZ_CP007589.1 99.69 4905 10 5 708933 713835 715477 720378 0 8970
NC_018077.1 NZ_CP007589.1 99.65 4906 10 6 715583 720483 708827 713730 0 8959
So, although I didn't see it in the manual I believe the numbers after each genome in the psi-cd-hit cluster output are the segments of sites that pass the -c threshold i.e.
>NZ_CP007589.1.. at 0.0/672875:123109:1407:1041:348:978:346:249:90:52aa/81.82%`
means a block of 672875 passed, followed by a block of 123109 sites etc and if I sum these sites and divide by the genomes length I get 82%. When I look to the blast output it also shows a block of length 672875 at 99% identity and a block of 123109 but the block of length 181393 is not present in the psi-cd-hit output and may be accounting for the lower score (82% by psi-cd-hit). Am I missing something. I have seen a similar question on this Remove redundancy from GenBank plasmid database using cd-hit-est but this was using cd-hit-est and I am using the -circle option with psi-cd-hit Can anyone help Thanks
Are you sure about what you are doing? From the documentation:
Ok, sorry, just read the docs more carefully, and it is indeed an envisaged use case.