Question: Comparing the WG sequenced isolate against the other same genus isolates on the public database
7 days ago
Our team recently sequenced a whole genome (WG) of bacterial isolate and performed denovo assembly of that bacterial isolate. Currently I am planning to do following things

1) Compare the WG sequenced bacterial isolate to other publicly available sequences of Escherichia or any other deposited sequences in NCBI. I am thinking of doing BLAST against the custom database. Creating the custom database of all Escherichia isolates and blasting it against my WG sequenced bacterial isolate. Am I correct with my approach?

2) I received another similar strain of WG sequenced bacterial isolate from our collaborator. I would like to compare our team's WG sequenced isolate (isolate A) with our collaborator's isolate B and identify SNPs between the two isolates. I have used GATK for SNP identification for human samples. But for bacterial sequences what methods/tools are used?

I am open for any other recommendations to the above requirement 1 and 2. Any suggestions.

Have you looked at mauve ( ). You may also want to look at aligners meant to align chromosome size chunks like LAST or LASTZ. Blast is a local aligner and will not be totally appropriate here.

For #2 you could use from BBMap suite after alignments to reference since you have simple haplid genomes because of bacteria.

Yes, I checked it, as per their website currently it is unsuitable for datasets with more than 50 bacterial genomes. But I suspect the task 1 might have more than 50 bacterial genomes.

For the task2, I need to compare isolate A and isolate B and identify variants. Do you mean I need to consider one of the isolate as the reference?

It is useful to include this type of information in original question (>50 genomes). If you are doing smaller comparisons then using mauve would still give you a birds-eye view of overall rearrangements across these strains (which should be largely similar).

If you want to compare SNP present in the strains you are going to need to use a particular strain as reference (which can be one in GenBank) and then compare the rest to that. You can then compare the VCF files.

Mash/sourmash mentioned below would also be good programs to try.

8 hours ago
If you're interested in simply comparing overall genome similarity, Mash (using minhash to compare genomic content) has become a gold standard of sorts for what you're describing ( Mauve also would be appropriate if you want information that an alignment could give you (i.e. genome rearrangement), but it will be less computationally efficient and perhaps overkill.

For SNP calling, you could alternatively use IGV, which calls SNPs by comparing to reference genomes for a given isolate (

Anvio also may be helpful in visualizing (or the underlying analysis within) your endeavors.

