Host Removal Issues Human/Dog with samples containing Eukaryotic species
1
0
Entering edit mode
21 days ago
Jon • 0

Hopefully someone can help with an issue I'm having. Anyone interested in directly helping would be VERY appreciated!!!

I have Shotgun Sequencing samples from 6 separate hosts, 4 human hosts, 2 dog hosts, all with similar species. These are all high host samples.

  • Host 1 human - I have Cerebral Spinal Fluid from nose, and whole blood. Sequenced on Novaseq 6000, Lab 1. Whole Blood sequenced on Aviti from 2 labs.
  • Host 2 human - Whole Blood, Aviti
  • Host 3 human - Whole Blood, Aviti
  • Host 4 human - Whole Blood, Aviti
  • Host 5 dog - Whole Blood, Aviti
  • Host 6 dog - Whole Blood, Aviti

There are numerous bacteria, which are less problematic as most databases have these species.

The two species I am having issues with are Naegleria Fowleri (Naegleria Fowleri strain Karachi NF001), and Plasmodium Ovale. I'm fairly confident in the presence of the Naegleria Fowleri species, I can pull reads directly and run on the Online NCBI Blast with 100% ID, I can also assemble them into long contigs. I have tried just doing de novo assembly with Metaspades, but I think the issue is that because Human and Naegleria Fowleri have reasonably high similarity that the error correction tends to correct the Naegleria Fowleri reads into human reads. Same issue with Host Removal, is that there is too much similarity and it removes too many Naegleria Fowleri reads. I'm not 100% sure on the Plasmodium Ovale reads, as reads that map to that genome appear to look like Naegleri Fowleri NF001 on BLAST. Unfortunately part of the issue on Plasmodium Ovale (one of the human malaria), is that those genomes are NOT represented with NCBI BLAST, nor NCBI CORE-nt database. They are not present in any of the Kraken2 databases as well.

It has taken a very long time just figuring out the Naegleri Fowleri NF001, as it gets misidentified by professional online platforms. On my initial samples for Host 1 (CSF & Blood), they were initially being identified as Mycobacterium Leprae in the range of 50,000 reads by two platforms Kmer based and KAIJU protein based. Yet I could only actually map a few reads. KAIJU also identified the Plasmodium Ovale (mappable). Then with a database update KAIJU started identifying the reads as Mycobacterium Tuberculosis (could only map a few reads). It wasn't until one of the platforms had the option to download reads that were classified, so I could blast them. It appears that both platforms were only hitting on a very small region of the Naegleria Fowleri Karachi NF001, which matched portions of (plasmodium Ovale, Mycobacterium Leprae, Mycobacterium Tuberculosis strain Oman).

I know there are some tools out there that can map all of these together and competitively map them, and split the output, but I can only work with the tools available on Galaxy.org

I had tried mapping the Naegleria Fowleri and assembling, but since the mapping also pulls host reads, it doesn't give a clean assembly.

I hope someone can help!!!

Thanks!!

Mapping Eukarotic Host • 921 views
ADD COMMENT
0
Entering edit mode

I can only work with the tools available on Galaxy.org

That requirement is likely going to make this very difficult to accomplish, there is no other way to say this.

bbsplit.sh from BBMap suite is the tool (that you may have read about) that allows splitting of datasets into multiple bins but you will need to find a way to run it on the command line. BBSplit syntax for generating builds for the reference genome and how to call different builds. has the syntax to use.

ADD REPLY
0
Entering edit mode

Yes, I have read about that tool, as it appeared to be what I was looking for. I tried installing the BBMap on my Windows computer, but couldn't figure out how to make it run.

I've been trying to do host removal, which I would lose some reads, but I am having issues of still having host reads remaining. I've tried HISAT2 using the HG38 as well as the newer T2T, same with Bowtie2, with adapters and lightly trimmed of course. I was hoping to try host removal with BBMap on Galaxy, but for some reason they don't have it configured to actually do host removal due to memory issues, only works on smaller genomes.

ADD REPLY
0
Entering edit mode

What are your thoughts of first Mapping my target reads at say 95% or higher first, then doing host removal of the remaining reads, and then adding them back in after host removal??

I have thought of trying to Map at 95% or so, then using BBMerge and Tadpole to error correct the reads, then use BBMap to pull reads that map at 99% or 100%, then assemble. But I'm not sure if that practically makes sense??

ADD REPLY
0
Entering edit mode

What is the main aim here? To recover reads that are NOT host.

Using some % cutoff is not a great strategy and reads will map by chance because the sequence matches even if they are coming from non-host genome.

ADD REPLY
0
Entering edit mode

Ideally yes, recover reads that are NOT host. Yet the Naegleri Fowleri has high similarity to human, so when I assemble the reads I don't have a clear answer if the contig is Naegleri Fowleri or Host.

I assume it's HIGHLY unlikely to pull reads that blast 100% to Naegleri Fowleri Karachi NF001?? For example, after host removal, I will map separately to Naegleri Fowleri, Tuberculosis Oman strain, Plasmodium Ovale. Then randomly pull reads from them and online BLAST them, with most of them coming back as either human or Naegleri Fowleri Karachi NF001 at or near 100%

I actually might be overthinking this, since the two species that I'm having issues with aren't represented in a usable database, and only the Naegleri Fowleri NF001 is represented in the Online version of NCBI Blast, that's what I've been using.

I switched to host removal using the HG38 Alt Masked reference, which is supposed to be better for for microbial species. That might help?

So rather than using the online Blast on my contigs, what is the best way to confirm if I have good matching contigs for a species??? I had read somewhere about using Minimap2 to map the contigs to the reference, and that the mapped reads are assumed to be a positive match??? Then I could confirm % by blasting against the reference on Galaxy.org ?? Then taking the contigs that match the reference and using Minimap2 and mapping back to the reads to get a count on how many reads map to the contigs???

The problem that I'm having is that because these two species aren't in the databases, that they are being identified as various other species due to similar regions in the genome. As I mentioned above, I know for sure that the Naegleri Fowleri NF001 strain is being misidentified as (Tuberculosis Oman strain, Myco Leprae, and possibly Plasmodium Ovale).

So what I am trying to do is as follows.

  1. Confirm that the species are present in the samples via assembly
  2. Once I know they are there, I can pull the reads that map to these species from the remaining reads, so that I can quantify what else is present.
ADD REPLY
0
Entering edit mode

Yet the Naegleri Fowleri has high similarity to human

That does not logically seem correct. It is a much smaller genome when compared to human. BLAST is not the correct tool when dealing with small reads since you are going to be looking at local alignments, which may extend only a few bases.

The problem that I'm having is that because these two species aren't in the databases,

Naegleris fowleri genome here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_008403515.1/ (it may be a different strain for extracting matching reads it should be enough). Plasmodium ovale genome is here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_900088485.1/

Because your starting material likely had very low amounts of non-host DNA, you are going to end up with correspondingly low amounts of non-host reads. You may want to use bbduk.sh from BBMap suite in filter mode to extract reads that match to these two genomes and leave others behind. It is possible that you will pick up a few host reads because of chance sequence similarity but that is a risk you will have to accept considering low amounts of non-host material.

ADD REPLY
0
Entering edit mode

For the Naegleri Fowleri Karachi NF001, which is the one identified by BLAST, it's not on the NCBI genomes. It took me a long time to find, the actual reference sequence was at the European Nucleotide Archive. NCBI only had the MANY unassembled contigs. As for mapping, the 8 reference sequences on NCBI do not map very well, but the Naegleri Fowleri Karachi NF001 maps very well. It's a newer sequence I think from 2021 for which the researchers said was significantly different from all of the other Naegleri Fowleri strains. For the Plasmodium Ovale, that species was redone and published in early 2024 as POW222. So I have the two best references to work with.

I actually tried using bbduk to filter the reads. I used the settings recommended by the author, which recommended using a 0.5 cutoff K31, so that at least 50% of the bases covered the matching Kmers. This matched a HUGE number of reads, so I adjusted to a setting of 0.95, so that at least 95% of the bases matched the reference kmers. Top of my head, that STILL put me at about 500,000 reads!!!!! YES, they would assemble to long contigs that matched the reference via BLAST at near 100 % match with many contigs in the 5,000-8,000 bases size with either Human or NF001 nearly the same %.

You can see my dilemma, the numbers don't make sense to me. Likewise there are very high levels of bacteria present. I know for Naegleri Fowleri, it's not generally found in blood, as it enters through nose and goes to the brain, so they test the CSF.

ADD REPLY
0
Entering edit mode

Naegleri Fowleri Karachi NF001, which is the one identified by BLAST

BLAST should only be used for qualitative purposes for NGS data.

I know for Naegleri Fowleri, it's not generally found in blood, as it enters through nose and goes to the brain, so they test the CSF.

When you started this experiment were you certain that you will find non-host reads in the samples that were taken. It is possible that your data has no (or very few) non host reads. If that is true, no bioinformatic juggling is going to find data that is not there. Perhaps that is the reason the numbers don't make sense.

There are numerous bacteria,

I am also puzzled by this from original post. If the sample taken was whole blood I assume it was collected aseptically and processed to make libraries. There should be no bacterial contamination in such a case. Only host/parasite should be present.

ADD REPLY
0
Entering edit mode

Yes, the samples were collected correctly aseptically, via nurse with correct methodology. Subject 1 there are 4 different samples (shotgun), using 3 different labs, all samples are similar.

As for the original experiment, there was suspicion of atypical infection, the assumption was bacterial. Subject 1 got what appeared to be a post surgical infection, from sinus surgery with CSF leak, that was ignored by doctors. Not long after, apparently the hospital had what they thought was a Klebsiella Pneumonia outbreak, that at least 35 people got sick and about 10 people died over numerous months (which sounds strange). Not an atypical bacteria, and generally treatable with antibiotics. Suspicion was from contaminated endoscopes.

I actually started with 16s samples, a LOT of bacteria present, to include Klebsiella. Then I moved onto shotgun sequencing, as I actually thought Mycobacterium which is difficult with 16s unless it's targeted for Mycobacteria. I only stumbled onto the Naegleria Fowleri, as one of the commercial bioinformatics providers identified it as Mycobacterium Leprae, yet I couldn't really find Leprae by mapping. I got the reads they identified as Leprae and ran them on BLAST, if I only ran it against Mycobacterium, they were identified as Mycobacterium Tuberculosis matching the Oman strain. Yet, if I just ran it against the whole NCBI database, it was Naegleria Fowleri NF001. But I couldn't actually map that reference until I could find it, once I found the reference, it mapped a LOT of reads, even after host removal.

So Naegleria Fowleri is highly deadly about 97% fatal in 2 weeks, and considered NOT contagious. Found in CSF for those infected. Mostly from swimming in warm water regions where it enters through the nose. Yet evidence on this strain, that it may be spread by other means. So Patient 1, with the sinus surgery, with continued CSF from nose as surgery didn't didn't fix it. Has had meningitis symptoms for 2.5 years, with often sneezing. So there is the source, sneezing contaminated CSF from nose. Spouse of Patient 1 also tests positive, 3 of their 5 dogs have died since, two of the dogs test positive with one still living. Parents of Patient 1 also with progressive health problems in this time period, ALSO test positive. I suspect it's spread MUCH wider than this, but I can't do much with notification to health officials until I can confirm what my suspicion is, you can see the problem?

I really don't have anyone helping with this whatsoever. The entire dataset and reference files I have on Galaxy.org, so they are easily available.

I understand the issues using BLAST. So if I do host removal, and assemble with Metaspades, how do I confirm Naegleri and Quantify??

Thanks!

ADD REPLY
0
Entering edit mode

If you're interested enough to look at the read files yourself, I would be happy to provide them. I keep running into things that I just can't make sense of.

ADD REPLY
0
Entering edit mode

If possible I suggest that you find someone who can easily access your entire dataset locally and would be willing to help out.

ADD REPLY
0
Entering edit mode
20 days ago
Mensur Dlakic ★ 29k

It is impossible that N. fowleri has high identity to humans, or else it would be classified as a primate rather than an amoeba. Even organisms that are much closer to humans than amoebas, say amphibians or reptiles, don't have 90+% global identity to humans. Something is fundamentally wrong in how the analysis is done. For example, why not request 98-99% identity during mapping and see where that gets you?

It is an educated guess that t-SNE binning, or any other type of tetranucleotide-based binning, would readily separate human vs. other contigs. Once that is achieved - the advantage here is that it doesn't require any reference - a mapping based on high coverage and identity should do the trick. Even though I have not specifically separated human and amoeba reads, I have done enough similar things to know that this is doable. It certainly should not be a struggle, especially with reference sequences available.

ADD COMMENT
0
Entering edit mode

I can't argue with you, I'm just having problems wrapping my brain around the issue. I have wondered if the reference that NCBI BLAST is using is not good, I think there is like 3000+ contigs in the NCBI reference database . I found it confusing as well, as my first attempts is a contig of about 7000bp that I ran on BLAST and the results were nearly identical for the NF001 strain and Human. I chopped the NF001 reference into 50, 75, 100, 150bp and mapped them against the human HG38 reference, and only about 0.004% mapped to human.

I only recently found the assembled reference on the European Nucleotide Archive.

For binning, I had thought that I read that the binning is for bacterial genomes???

So are you suggesting (I assume after host removal Hg38), that I directly map the reads at 98-99%, or are you suggesting that I assemble with Metaspades and then map the contigs?? I can map fastq reads with BBMap based on percentage, but I don't know how to map fasta contigs by percentage, I'm only working on the Galaxy site. I think I read somewhere that I can map contigs with Minimap2 which was stated as 95%+

I'm currently reprocessing the reads by first mapping against the Hg38 Masked reference. Then running on Metaspades with just error correction, and then going to do a second set of host removal. Then assemble with Metaspades, yet again.

ADD REPLY
0
Entering edit mode

Binning doesn't separate any particular organisms. It separates DNA contigs based on tetranucleotide frequencies. Eukaryotes use different tetranucleotide frequencies than bacteria, and that's true within eukaryotic phyla as well. Haven't tried specifically to bin human and amoeba contigs, but it is a safe guess they would separate without a problem.

I was suggesting that you map the reads directly to a reference sequence at a 98-99% identity. Alternatively, you can bin the existing contigs, which I expect to clearly separate human vs. other sequences. Then you can map to those bins as well.

I don't know how to do any of this on Galaxy, and you don't seem to have access to tools outside of Galaxy. I applaud your willingness to write these detailed explanations, but unfortunately that doesn't mean they are clear. As already suggested, you should try to find someone locally who knows these things, or at least someone who owes you a favor or has a personal interest in helping you. It is unrealistic to expect that random strangers on the internet would guide you step-by-step through all these issues.

ADD REPLY

Login before adding your answer.

Traffic: 1276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6