This is somewhat of a follow-up from previous posts (most recently C: How to determine which NCBI sequence to map against (multiple sequences for sing ). I was recently given a paired-end .fastq files (where each read has about 150 bases). The wet-lab researchers believe it is from an isolated strain of bacteria likely Helicobacter (this may be questionable). It was taken from dolphin stomach.
My job is to determine the identity (closest species etc.) of these samples. I have been attempting this for weeks and am stuck. Below are a few approaches I have tried:
I created contigs using SPAdes. I blasted a few of the contigs. Some seemed to align to Helicobacter cetorum MIT 00-7128. So, I downloaded that as a reference and aligned the sample to it using BWA. It had only 13% mapping rate. I also noticed that some contigs (shorter ones) blasted to Mus musculus/ Homo sapiens.
I thought I should do a more sophisticated approach than blasting contigs. So, I tried (2), (3), (4) below:
Downloaded all “Complete genomes” of “Helicobacter” genus from NCBI (n=221). Used sourmash to compute k-mer sequences to determine relatedness of genomes. The highest similarity was 43.5% to Helicobacter pylori P12, but there were dozens of similarities with almost the same percentage to other Helicobacter strains.
Ran Kraken on Galaxy against a database of "plasmids", "viruses", and "bacteria". It had only 14% bacteria classified (and even less virus/plasmid), with most (5.66%) mapping to Helicobacter pylori.
Ran sendsketch on the contigs. It had WKID = 38.9%; KID = 0.05%; ANI = 96.6%; Contam = 2.5% for H. macacae (and similar numbers for H. mastomyrinus) and WKID = 1.1%; KID = 0.85%; ANI = 84.8%; Contam = 1.7% for H. cetorum MIT 00-7128.
I thought the numbers above seemed low (please let me know if you think otherwise). They also do not seem consistent (with H. pylori showing up more in sourmash and Kraken and H. macacae showing up most in sendsketch) - but still that probably does not matter given how low these values are anyway.
In my latest post (linked above), a user suggested taking small selections of reads (~20-25) and blasting. I did this for both the first 25 bases and the last 25 bases in the file. BlastN on the first 25 bases had Helicobacter cetorum MIT 00-7128 with lowest e-value (3e-39), megablast on the first 25 bases had "no significant similarity found message", BlastN on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (1e-25), and megablast on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (3e-19). I am not sure if there are certain parameters I should use (such as blastN versus megablast) and whether theses results seem consistent - especially because the highest WKID value from sendsketch was a different Helicobacter species.
I do not have much experience at all with this type of analysis. Over the weeks, I feel like I am working in circles and throwing different software at this sample. My main findings so far: 1) There may be contamination, 2) Mapping/alignment/similarity scores seem low, 3) Different software can point (despite being low in value) to different Helicobacter species.
The biologists press me to tell them what species their sample is and I have been unable to feel confident about giving an answer. My worries are that it could be that it is not bacteria (maybe archaea), is too contaminated, is not "isolated", etc.
For anyone with experience, what would you recommend to someone in my position to confidently determine what species this is (if that even seems possible at this point)? Specifically:
How should I determine if contamination should be removed and how should I do so safely (if needed)?
When does one feel confident they can report the species back to biologists? Are my numbers above indeed too low and conflicting?
How can I determine this is isolated bacteria? And not something else like archaea or pure contamination or multiple species?
What other approaches should I take to determine the species (especially interested to hear ideas that add something new to what I have already tried, i.e. are not simply throwing another software similar to the ones above at it) :oD .
I apologize for the long post (wanted to be clear about what I have tried). Thank you for sharing your ideas.
In case you're still stuck with this problem (I'm assuming not), I would be happy to guide you further or collaborate with you on this! Cheers