How can I confidently determine if my genome sequence Is truly Anaplasma marginale?
2
0
Entering edit mode
5 weeks ago

I am doing my thesis on comparative whole genome analysis of Anaplasma marginale. I am comparing my whole genome sequence of Anaplasma marginale with 24 other strains from NCBI. Using general bioinformatics pipeline, I assembled my genome with SPAdes. Since the assembly was fragmented, I used RagTag to scaffold the contigs. My genome coverage was 23.4x.

When I performed comparative analysis, I calculated Average Nucleotide Identity (ANI) using both JSpecies and FastANI. In both cases, the ANI value compared to Anaplasma marginale strains was around 85–86%, whereas from the literature, the species threshold is usually between 95–96%.

When I created a phylogenetic tree, my strain was located very far from the cluster of the 24 marginale strains—too distant. For phylogenetic tree construction, I used Progressive Mauve for multiple sequence alignment and IQ-TREE for tree generation. In all cases, my sequence predominantly matched Anaplasma ovis.

However, when I used KBase automated software for phylogenetic tree construction with the same FASTA file, but instead of only including the 24 Anaplasma marginale strains, KBase included a broader database of related Anaplasma organisms, my sequence clustered with one of the Anaplasma marginale strain.

Now I am really confused about whether KBase is reliable. Can I trust the KBase result when all my Linux-based tools show significant similarity with Anaplasma ovis rather than Anaplasma marginale? I am confused about how I can be sure whether my sequence is actually Anaplasma marginale or not.

anaplasma genomics marginale omics • 538 views
ADD COMMENT
1
Entering edit mode

Since the assembly was fragmented, I used RagTag to scaffold the contigs.

How many contigs did you end up with. You should add more coverage, specially long read coverage (if you only used short reads) to get a better assembly.

My genome coverage was 23.4x.

Did you try to align your data with RefSeq genome. Did you get coverage (at least a few reads) across the entire genome or were there areas that had no reads aligned.

ADD REPLY
2
Entering edit mode
5 weeks ago
michael.ante ★ 4.0k

Hi,

The low(er) ANI might be a result from the 23x coverage, which seems a bit low to me.

There are a couple of tools which you can also use to verify your species. E.g., Kraken2 and Centrifuge work on read-level to identify the taxonomy. SourMash can be used to classify your assembly.

Additionally, you can extract the 16S sequence of your assembly and run your phylogenetic analysis on that.

ADD COMMENT
2
Entering edit mode
5 weeks ago

I would keep it simple and just blast contigs/genes against general blast DBs at the NCBI/EBI. You will soon work out what you have due to the hit distributions and top hits.

85% is certainly very distant though so I suspect you've sequenced another genus at least.

I am also a fan of sourmash for a quick and easy classification too as michael suggests.

ADD COMMENT

Login before adding your answer.

Traffic: 3365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6