I am aiming to determine the species/subspecies identity of a (possibly bacteria) sample. I applied the following sendsketch command:
where the input file (contigs.fasta) was previously generated from the raw paired-end read file using SPAdes. The results of the above command can be viewed here: here. The sendsketch website states that WKID is "the column that tells you the actual sequence similarity (disregarding size)." As a result, I believe I should focus on H. macacae since it had the highest WKID value (along with KID = 0.05%; ANI = 96.6%; Contam = 2.5%).
1) Are the output values here sound or a cause for alarm? Is the WKID (38.9%) or KID (0.05%) value too low? Or are other values questionable?
2) If the values are sound, I hope to check the mapping rate (using BWA) of the sample to the H. macacae genome (so that I can use it as a reference genome for further steps). However, I am a bit stuck on figuring out the appropriate sequence to download. The taxID number output from the sendsketch was 398626. It seems there are 46 sequences available on NCBI (here). How should I go about determining the optimal H. macacae sequence to map my samples to?
Side note: I have a similar problem with H. mastomyrinus, the second-highest WKID value from sendsketch. There are 17 sequences available on NCBI here. In this case, some are 16S and some are 23S. How do I know which sequence to map against in this scenario?