Hopefully someone can help with an issue I'm having. Anyone interested in directly helping would be VERY appreciated!!!
I have Shotgun Sequencing samples from 6 separate hosts, 4 human hosts, 2 dog hosts, all with similar species. These are all high host samples.
- Host 1 human - I have Cerebral Spinal Fluid from nose, and whole blood. Sequenced on Novaseq 6000, Lab 1. Whole Blood sequenced on Aviti from 2 labs.
- Host 2 human - Whole Blood, Aviti
- Host 3 human - Whole Blood, Aviti
- Host 4 human - Whole Blood, Aviti
- Host 5 dog - Whole Blood, Aviti
- Host 6 dog - Whole Blood, Aviti
There are numerous bacteria, which are less problematic as most databases have these species.
The two species I am having issues with are Naegleria Fowleri (Naegleria Fowleri strain Karachi NF001), and Plasmodium Ovale. I'm fairly confident in the presence of the Naegleria Fowleri species, I can pull reads directly and run on the Online NCBI Blast with 100% ID, I can also assemble them into long contigs. I have tried just doing de novo assembly with Metaspades, but I think the issue is that because Human and Naegleria Fowleri have reasonably high similarity that the error correction tends to correct the Naegleria Fowleri reads into human reads. Same issue with Host Removal, is that there is too much similarity and it removes too many Naegleria Fowleri reads. I'm not 100% sure on the Plasmodium Ovale reads, as reads that map to that genome appear to look like Naegleri Fowleri NF001 on BLAST. Unfortunately part of the issue on Plasmodium Ovale (one of the human malaria), is that those genomes are NOT represented with NCBI BLAST, nor NCBI CORE-nt database. They are not present in any of the Kraken2 databases as well.
It has taken a very long time just figuring out the Naegleri Fowleri NF001, as it gets misidentified by professional online platforms. On my initial samples for Host 1 (CSF & Blood), they were initially being identified as Mycobacterium Leprae in the range of 50,000 reads by two platforms Kmer based and KAIJU protein based. Yet I could only actually map a few reads. KAIJU also identified the Plasmodium Ovale (mappable). Then with a database update KAIJU started identifying the reads as Mycobacterium Tuberculosis (could only map a few reads). It wasn't until one of the platforms had the option to download reads that were classified, so I could blast them. It appears that both platforms were only hitting on a very small region of the Naegleria Fowleri Karachi NF001, which matched portions of (plasmodium Ovale, Mycobacterium Leprae, Mycobacterium Tuberculosis strain Oman).
I know there are some tools out there that can map all of these together and competitively map them, and split the output, but I can only work with the tools available on Galaxy.org
I had tried mapping the Naegleria Fowleri and assembling, but since the mapping also pulls host reads, it doesn't give a clean assembly.
I hope someone can help!!!
Thanks!!
That requirement is likely going to make this very difficult to accomplish, there is no other way to say this.
bbsplit.sh
from BBMap suite is the tool (that you may have read about) that allows splitting of datasets into multiple bins but you will need to find a way to run it on the command line. BBSplit syntax for generating builds for the reference genome and how to call different builds. has the syntax to use.Yes, I have read about that tool, as it appeared to be what I was looking for. I tried installing the BBMap on my Windows computer, but couldn't figure out how to make it run.
I've been trying to do host removal, which I would lose some reads, but I am having issues of still having host reads remaining. I've tried HISAT2 using the HG38 as well as the newer T2T, same with Bowtie2, with adapters and lightly trimmed of course. I was hoping to try host removal with BBMap on Galaxy, but for some reason they don't have it configured to actually do host removal due to memory issues, only works on smaller genomes.
What are your thoughts of first Mapping my target reads at say 95% or higher first, then doing host removal of the remaining reads, and then adding them back in after host removal??
I have thought of trying to Map at 95% or so, then using BBMerge and Tadpole to error correct the reads, then use BBMap to pull reads that map at 99% or 100%, then assemble. But I'm not sure if that practically makes sense??
What is the main aim here? To recover reads that are NOT host.
Using some % cutoff is not a great strategy and reads will map by chance because the sequence matches even if they are coming from non-host genome.
Ideally yes, recover reads that are NOT host. Yet the Naegleri Fowleri has high similarity to human, so when I assemble the reads I don't have a clear answer if the contig is Naegleri Fowleri or Host.
I assume it's HIGHLY unlikely to pull reads that blast 100% to Naegleri Fowleri Karachi NF001?? For example, after host removal, I will map separately to Naegleri Fowleri, Tuberculosis Oman strain, Plasmodium Ovale. Then randomly pull reads from them and online BLAST them, with most of them coming back as either human or Naegleri Fowleri Karachi NF001 at or near 100%
I actually might be overthinking this, since the two species that I'm having issues with aren't represented in a usable database, and only the Naegleri Fowleri NF001 is represented in the Online version of NCBI Blast, that's what I've been using.
I switched to host removal using the HG38 Alt Masked reference, which is supposed to be better for for microbial species. That might help?
So rather than using the online Blast on my contigs, what is the best way to confirm if I have good matching contigs for a species??? I had read somewhere about using Minimap2 to map the contigs to the reference, and that the mapped reads are assumed to be a positive match??? Then I could confirm % by blasting against the reference on Galaxy.org ?? Then taking the contigs that match the reference and using Minimap2 and mapping back to the reads to get a count on how many reads map to the contigs???
The problem that I'm having is that because these two species aren't in the databases, that they are being identified as various other species due to similar regions in the genome. As I mentioned above, I know for sure that the Naegleri Fowleri NF001 strain is being misidentified as (Tuberculosis Oman strain, Myco Leprae, and possibly Plasmodium Ovale).
So what I am trying to do is as follows.
That does not logically seem correct. It is a much smaller genome when compared to human. BLAST is not the correct tool when dealing with small reads since you are going to be looking at local alignments, which may extend only a few bases.
Naegleris fowleri genome here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_008403515.1/ (it may be a different strain for extracting matching reads it should be enough). Plasmodium ovale genome is here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_900088485.1/
Because your starting material likely had very low amounts of non-host DNA, you are going to end up with correspondingly low amounts of non-host reads. You may want to use
bbduk.sh
from BBMap suite in filter mode to extract reads that match to these two genomes and leave others behind. It is possible that you will pick up a few host reads because of chance sequence similarity but that is a risk you will have to accept considering low amounts of non-host material.For the Naegleri Fowleri Karachi NF001, which is the one identified by BLAST, it's not on the NCBI genomes. It took me a long time to find, the actual reference sequence was at the European Nucleotide Archive. NCBI only had the MANY unassembled contigs. As for mapping, the 8 reference sequences on NCBI do not map very well, but the Naegleri Fowleri Karachi NF001 maps very well. It's a newer sequence I think from 2021 for which the researchers said was significantly different from all of the other Naegleri Fowleri strains. For the Plasmodium Ovale, that species was redone and published in early 2024 as POW222. So I have the two best references to work with.
I actually tried using bbduk to filter the reads. I used the settings recommended by the author, which recommended using a 0.5 cutoff K31, so that at least 50% of the bases covered the matching Kmers. This matched a HUGE number of reads, so I adjusted to a setting of 0.95, so that at least 95% of the bases matched the reference kmers. Top of my head, that STILL put me at about 500,000 reads!!!!! YES, they would assemble to long contigs that matched the reference via BLAST at near 100 % match with many contigs in the 5,000-8,000 bases size with either Human or NF001 nearly the same %.
You can see my dilemma, the numbers don't make sense to me. Likewise there are very high levels of bacteria present. I know for Naegleri Fowleri, it's not generally found in blood, as it enters through nose and goes to the brain, so they test the CSF.
BLAST should only be used for qualitative purposes for NGS data.
When you started this experiment were you certain that you will find non-host reads in the samples that were taken. It is possible that your data has no (or very few) non host reads. If that is true, no bioinformatic juggling is going to find data that is not there. Perhaps that is the reason the numbers don't make sense.
I am also puzzled by this from original post. If the sample taken was whole blood I assume it was collected aseptically and processed to make libraries. There should be no bacterial contamination in such a case. Only host/parasite should be present.
Yes, the samples were collected correctly aseptically, via nurse with correct methodology. Subject 1 there are 4 different samples (shotgun), using 3 different labs, all samples are similar.
As for the original experiment, there was suspicion of atypical infection, the assumption was bacterial. Subject 1 got what appeared to be a post surgical infection, from sinus surgery with CSF leak, that was ignored by doctors. Not long after, apparently the hospital had what they thought was a Klebsiella Pneumonia outbreak, that at least 35 people got sick and about 10 people died over numerous months (which sounds strange). Not an atypical bacteria, and generally treatable with antibiotics. Suspicion was from contaminated endoscopes.
I actually started with 16s samples, a LOT of bacteria present, to include Klebsiella. Then I moved onto shotgun sequencing, as I actually thought Mycobacterium which is difficult with 16s unless it's targeted for Mycobacteria. I only stumbled onto the Naegleria Fowleri, as one of the commercial bioinformatics providers identified it as Mycobacterium Leprae, yet I couldn't really find Leprae by mapping. I got the reads they identified as Leprae and ran them on BLAST, if I only ran it against Mycobacterium, they were identified as Mycobacterium Tuberculosis matching the Oman strain. Yet, if I just ran it against the whole NCBI database, it was Naegleria Fowleri NF001. But I couldn't actually map that reference until I could find it, once I found the reference, it mapped a LOT of reads, even after host removal.
So Naegleria Fowleri is highly deadly about 97% fatal in 2 weeks, and considered NOT contagious. Found in CSF for those infected. Mostly from swimming in warm water regions where it enters through the nose. Yet evidence on this strain, that it may be spread by other means. So Patient 1, with the sinus surgery, with continued CSF from nose as surgery didn't didn't fix it. Has had meningitis symptoms for 2.5 years, with often sneezing. So there is the source, sneezing contaminated CSF from nose. Spouse of Patient 1 also tests positive, 3 of their 5 dogs have died since, two of the dogs test positive with one still living. Parents of Patient 1 also with progressive health problems in this time period, ALSO test positive. I suspect it's spread MUCH wider than this, but I can't do much with notification to health officials until I can confirm what my suspicion is, you can see the problem?
I really don't have anyone helping with this whatsoever. The entire dataset and reference files I have on Galaxy.org, so they are easily available.
I understand the issues using BLAST. So if I do host removal, and assemble with Metaspades, how do I confirm Naegleri and Quantify??
Thanks!
If you're interested enough to look at the read files yourself, I would be happy to provide them. I keep running into things that I just can't make sense of.
If possible I suggest that you find someone who can easily access your entire dataset locally and would be willing to help out.