I am doing my thesis on comparative whole genome analysis of Anaplasma marginale. I am comparing my whole genome sequence of Anaplasma marginale with 24 other strains from NCBI. Using general bioinformatics pipeline, I assembled my genome with SPAdes. Since the assembly was fragmented, I used RagTag to scaffold the contigs. My genome coverage was 23.4x.
When I performed comparative analysis, I calculated Average Nucleotide Identity (ANI) using both JSpecies and FastANI. In both cases, the ANI value compared to Anaplasma marginale strains was around 85–86%, whereas from the literature, the species threshold is usually between 95–96%.
When I created a phylogenetic tree, my strain was located very far from the cluster of the 24 marginale strains—too distant. For phylogenetic tree construction, I used Progressive Mauve for multiple sequence alignment and IQ-TREE for tree generation. In all cases, my sequence predominantly matched Anaplasma ovis.
However, when I used KBase automated software for phylogenetic tree construction with the same FASTA file, but instead of only including the 24 Anaplasma marginale strains, KBase included a broader database of related Anaplasma organisms, my sequence clustered with one of the Anaplasma marginale strain.
Now I am really confused about whether KBase is reliable. Can I trust the KBase result when all my Linux-based tools show significant similarity with Anaplasma ovis rather than Anaplasma marginale? I am confused about how I can be sure whether my sequence is actually Anaplasma marginale or not.
How many contigs did you end up with. You should add more coverage, specially long read coverage (if you only used short reads) to get a better assembly.
Did you try to align your data with RefSeq genome. Did you get coverage (at least a few reads) across the entire genome or were there areas that had no reads aligned.