Which Reference Hg38 or T2T?? Both??
2
0
Entering edit mode
9 weeks ago
Jon • 0

I have a question on Host Reference used. I know that nearly all of the public reference genomes, that Hg38 or Older was used. The newest T2T is supposed to be nearly complete, and is recommended by many sites.

So if I have my human sample and remove reads matching T2T, that Should? give me a host free set of reads???

My confusion is that if I map to a reference in the host removed reads, and if I use NCBI Blast, I may get one or several human reads that map better than that of the reference of the mapped reads. I often get say human 99% then the reference I mapped to may be 98.5%, many times they map identical at say 99+%. My assumption is that since BLAST uses Hg38, that seeing human reads on BLAST should be ignored??? Just go with the T2T host removed reads??

I have previously chopped up my reference sequence (50, 75, 100, 150bp), and mapped against Hg38 and T2t. Hg38 mapped about 5,000 sequences and T2t mapped about 10,000 sequences.

Host Removal • 3.1k views
ADD COMMENT
0
Entering edit mode

HI GenoMax, I saw what you posted but couldn't find how to respond to your comment. I understand the point that you are making that something needs to be done for treatment, though it's not as simple as that. #1 I had to confirm Naegleria Fowleri was present, which I did. Then I needed to separate those reads from the rest of the samples. I have 16S samples that also show a High Bacterial load on to of the Naegleria Fowleri, if it's present, then these bacterial species need to be treated as well with the correct antibiotics. I already know that portions of this Naegleria Fowleri genome matches up to numerous other bacterial and other genomes, and I have read that Naegleria Fowleri appears to also show up in 16S results. Other parasites such as Malaria are also a promoter of bacteremia, as the immune system is muted allowing bacteria to proliferate. My assumption is that if Naegleria Fowleri is treated, then the immune system may suddenly kick back in, so they must be treated together.

ADD REPLY
0
Entering edit mode

After thinking about your post for some time I had posted a comment, which I then removed. On second thought I did not think it was related to bioinformatics, but looks like you did read it.

I have 16S samples that also show a High Bacterial load on to of the Naegleria Fowleri, if it's present, then these bacterial species need to be treated as well with the correct antibiotics.

What does this mean? You have 16S amplicons independently prepared from the same samples? What does "high bacterial load" mean?

What number (and % of reads) are remaining after all the steps you have gone through. In case of NGS experiments there tends be a certain % (small) of reads that may remain that can't be easily explained.

My assumption is that if Naegleria Fowleri is treated, then the immune system may suddenly kick back in, so they must be treated together.

Going to make an exception and say the following. Are you a physician able to make this decision as a part of the treatment. If not, the person in that role should be making this decision, right?

ADD REPLY
0
Entering edit mode

Yes, I read it, I read everything I can.

I don't have 16s on all of the samples that I have NGS, but I have 2 that went through several different 16s analysis. Before I went to the NGS samples, I did a fair amount of 16s samples V1-V2, V1-V3, V3-V4, V7, Pacbio Full Length. High bacterial Load, in terms of whole blood samples the 16s samples showed high levels of pathogenic bacteria as well as numerous nitrogen fixing bacteria. Top of my head for clinical blood samples I believe it is 100 reads/M for a positive result, so we are talking well in excess of that. But I know some of the hits on the 16s samples are actually Naegleria Fowleri Karachi NF001, for example Plasmodium Ovale was showing up on some of the samples, as well as some Mycobacterium, both of which align with portions of the NF001 genome. I'm going to relook at the 16s samples after I finish the NGS samples.

Previously I was using several of the online bioinformatics platforms, with the assumption that if they identified various species, I should be able to map the same species to the sample. The problem was that I wasn't able to reproduce via mapping and assembly what they were identifying. I suspect the running them through the online platforms that after removing the NF001 genome, that it will show whatever bacteria is present.

Obviously a doctor will make the final decisions. #1 is that what I'm finding hasn't been reported before (not contagious-I have 4 humans and 2 dog with the same NF001 showing it's contagious between humans and animals). (Death is 97% within 2 weeks, subject #1 2.5 years). (Whole blood not tested as the consensus is that counts are too low in blood to be statistically valid for diagnosis, the counts I'm finding are MASSIVE).

As for the bacteria issue, there has never been anything reported that I have seen regarding Naegleria Fowleri/Bacteria co-infection). But amoebas to have similarities, but since most people die so fast they probably didn't look to far. The Plasmodium species for Malaria are known to cause bacteremia as essentially the amoeba shuts down the immune system allowing the bacteria to prolificate. Though it's also not been documented, it's highly unlikely that you could do a blood culture for bacteria in the presence of Naegleria Fowleri, as the primary food source for them is bacteria, and Naegleria Fowleri is the dominant species by far.

I'm just doing the actual work to determine what is present, it's highly unlikely to find a doctor to do anything with this, as it's way over knowledge, even infectious disease doctors as there are only a handfull of cases in the USA. T

The CDC is the main provider for Naegleria Fowleri analysis, for which I believe the standard is PCR targeting ITS1/5.8s/ITS2 region. But I'm unsure if that test is valid for this strain, I'm assuming they have a standard reference for which in general would have coverage of near 100% at near 100% match. There was a utility that I found online that provided a blastn link for the PCR target which was matched against all of the Nagleria Fowleri strains, the NF001 strain I believe was 66% coverage at 100% match showing this strain was completely missing one of the targets. I doubt the 66% coverage would show a positive result, but I could be wrong.

In essence, your question if this is related to bioinformatics, sure looks like it to me.

My remaining reads I haven't quite gotten to yet. Rather than trying to use more aggressive mapping to pull the Naegleria Fowleri reads, I'm concentrating only with perfect matching reads to NF001. Anything other than perfect has a high chance of being human and then it tries to assemble NF001 and human together. So my unknown reads ((not mapping perfect to NF001, Hg38, and T2T), and not mapped to T2T with HISAT2 with default settings). I'm running the unknown reads with error correction on Metaspades, and then going to separate out the perfect reads again. This should give me a dataset nearly clean of NF001 and most of the human reads removed.

ADD REPLY
1
Entering edit mode
9 weeks ago
GenoMax 153k

So if I have my human sample and remove reads matching T2T, that Should? give me a host free set of reads???

For purpose of host decontamination it does not matter which genome build you use. NGS aligners are stochastic and it is possible that even after an initial round of removal, a small number of reads, in "cleaned" data, may match in second round against a different version e.g.hg38.

Using tetra nucleotide frequencies (as suggested by Mensur Dlakic ) in a previous thread may be a better option to separate host sequences.

I have previously chopped up my reference sequence (50, 75, 100, 150bp), and mapped against Hg38 and T2t. Hg38 mapped about 5,000 sequences and T2t mapped about 10,000 sequences.

What reference are you describing? What is the relevance of this to the choice of the human genome build.

We have discussed this in a past thread but BLAST does local alignments and may show hits that are not for the full length of the read. So please use BLAST for qualitative/quick assessments. There is a version of BLAST specifically meant for short reads. It is called MagicBLAST (LINK). You could use that.

ADD COMMENT
0
Entering edit mode

The reference I'm having issues is the Naegleria Fowleri Karachi NF001, as this is the one that shows up in my samples. Everything I have read about this strain, is that it's significantly different from all of the other strains.

At the moment, I am just working on the reads that mapped with Hisat2 with default settings to the Hg38 & T2T references. So just doing some error correct with Overlap correction via BBMerge and Tadpole, then remapping to NF001 and T2T, there is a considerable amount of reads leftover that map to NF001 and not T2T. I have assembled some of them with metaspades which gave me good contigs in the 4000bp that blast identified at 100%, so my ASSUMPTION is that they are NF001 reads?? Then I plan to incorporate these with the host removed reads.

Yes, I understand that BLAST does local alignments, but if I get 100% alignment with a couple mismatched bases at say 98-100% ID, those are the ones I'm looking at.

So does the MagicBlast work differently than just using a global aligner like Hisat2 or BBMap??

ADD REPLY
0
Entering edit mode

enter image description here

This is one of the examples I'm talking about. This is after error correction and reads that don't match to T2T

ADD REPLY
0
Entering edit mode

but if I get 100% alignment with a couple mismatched bases at say 98-100% ID, those are the ones I'm looking at.

It is not clear what database you are searching against but 3 of the 4 "hits" are to "human" sequences according to this table.

So either those three sequences identified as "human" are incorrect or that the NF001 sequence has (human?) sequence contamination in it.

If there was truly a ~4 kb stretch of similar sequence in human and fowleri genomes, these reads should have been removed during initial decontamination phase.

ADD REPLY
0
Entering edit mode

I have assembled some of them with metaspades which gave me good contigs in the 4000bp that blast identified at 100%, so my ASSUMPTION is that they are NF001 reads??

That is the working hypothesis. You will need to design an independent experiment (e.g. long range PCR or Nanopore sequencing) to show that that fragments you assembled exist in the original sample, since you plan to use this result for a diagnostic purpose (if I recall right).

Then I plan to incorporate these with the host removed reads.

Not sure what that means.

ADD REPLY
0
Entering edit mode

Those were BLAST results using the default NCBI Core_nt database, which apparently includes human reads sequenced nearly 20-25 years ago, some newer.

Portions of the Nagleria Fowleri Karachi NF001 indeed match human reads, but I'm not sure how much, and it depends on the human reference used (see below)

I was looking at my blast results again. The Host reads were already removed with either hg38 or T2T references. So it's my understanding that newer references make corrections on previously sequenced and assembled reference sequences. So T2T should technically replace the 2013 hg38, as hg38 wasn't complete and had numerous errors???

So along that line, I was looking at the Accession number for the human identified reads, these are all VERY old, and mostly prior to hg38. Should these technically be ignored, as I assume T2T or even hg38 should include this same region.

You previously said it shouldn't matter which reference hg38 or T2T. So for one of my read sets using only the reads that aligned to both Hg38 & T2T for host removal, I used Trimmomatic with Q20, Trailing Q15, Min Length 35, after which I used FASTP to only Error Correct the overlap, but leaving as a pair. Then using BBduk against the NF001 reference, to match K31 with 98% of bases matching gives 886,583 reads that match NG001 at nearly 98% match.

Using Hisat2 with Default Parameters for host removal from the 886,583, you can see the difference with the two human references. hg38 - I retained 112,651 reads T2T - I retained 19,146 reads

Since the NF001 reference was I believe done in 2021, if they used any human reference, they would have used hg38, so my thought is that would be the appropriate one??? OR T2T because it's newer?

ADD REPLY
0
Entering edit mode

So along that line, I was looking at the Accession number for the human identified reads, these are all VERY old, and mostly prior to hg38. Should these technically be ignored, as I assume T2T or even hg38 should include this same region.

GenBank is an archival database. What that means is once a sequence gets submitted, it will stay around unless the original submitter asks for its removal and/or NCBI staff determine that the sequence should be removed based on experimental and/or other information available to them. So the human sequence accession record https://www.ncbi.nlm.nih.gov/nuccore/BX294002.19/ has actually been modified at least 19 times (in version 19 now) so there is evidence over time that this is a real sequence. It also does align to T2 assembly with blat (see below).

blat

Let us say that you did everything right and the data you have shows similarity to the amoeba. So the working hypothesis now is that the samples contain DNA from the amoeba, which may be leading to the manifestations seen. One way to prove the observation is real is to show that the fragments you assembled from DNA libraries, actually exist in original DNA sample (that is not treated/fragmented). One way to do that is by using nanopore sequencing and/or by using a separate experimental method like long-range PCR.

ADD REPLY
1
Entering edit mode
9 weeks ago
cmdcolin ★ 4.3k

There are existing specialized tools for host read removal such as

https://github.com/bede/hostile (publication https://academic.oup.com/bioinformatics/article/39/12/btad728/7457481)

https://github.com/bede/deacon (publication https://www.biorxiv.org/content/10.1101/2025.06.09.658732v1)

You may want to review the methods used

They do use the T2T reference in their pipeline (quote "A custom human reference genome was built from the T2T-CHM13v2.0 human genome assembly (Nurk et al. 2022) and human leukocyte antigen (HLA) sequences")

ADD COMMENT
0
Entering edit mode

While this tool looks interesting, things are not as straight forward as they sound based on the main question in this thread (please refer to Host Removal Issues Human/Dog with samples containing Eukaryotic species and other threads by OP for more background on the experiment).

ADD REPLY
0
Entering edit mode

Thanks for the link. This is good background on what the OP is 'truly' asking here. The finding of N fowleri would be probably very serious...for example this patient went into a coma and died https://academic.oup.com/jtm/article/29/4/taab172/6420899

In any case, their aim seems to be in recovering perhaps eukaryotic microbes like N fowleri from human and dog samples. The "hostile" tool linked above doesn't talk a lot about eukaryotic microbes but i feel like it would be worth trying it out anyways.

ADD REPLY
0
Entering edit mode

This is an interesting study. Assuming OP has done everything right there is a reasonable hypothesis here. OP needs to prove it, with an independent means, that the fragments assembled are actually present in the original sample. Conclusive proof of the infection will require intervention from relevant medical personnel, additional sampling and/or medical imaging, all well beyond the scope of bioinformatics (and this forum).

ADD REPLY
0
Entering edit mode

I have read that Eukaryotes most difficult to classify, I can see now the issue, they have high similarity to human dna. Amoebas, such as Naegleria Fowleri, are reported to share about 30% similarity to humans. This isn't contamination, it's part of the genome, so aggressive host removal would remove Amoeba dna because it's similar, but not identical. Most classification pipelines do aggressive human host removal, and would wipe out a significant amount of Amoeba reads. HOSTILE host removal process is aggressive, to remove the highest amount of human (similar) dna.

Like I said previously, just THIS subset I'm working on are the ones classified as hg38 & T2T, I still have the other subset of host removed reads. I think the issue here initially was that I minimally trimmed the reads, so there would be a higher amount of potential errors, I believe I initially just trimmed to Q15 with the assumption that assembly would correct minor errors. So these reads I trimmed to Q20, Q15 trailing, and overlap corrected mergeable reads with FASTP. I believe HISAT2 classified the reads as humans, just because there were higher number of allowed errors in the reads.

I'm now trying to assemble reads with T2T host removal with Hisat2, these look much better alignment. My longest contig is 5112 bp at 100% coverage and 99.29%, the matching sequence is 5173 bp, so it nearly assembled the entire sequence. I noticed this with quite a few of the contigs nearly assembled to the entire sequence at 99%+ match.

This still leaves me with the issue of how many of my contigs match the reference with at least 98% match?? Binning likely wouldn't work, as the high similarity to human, they would all likely go into the same bin. I'm assuming some alignment program, but none that I know of that will match by percentage, except BBmap which doesn't do fasta.

ADD REPLY
0
Entering edit mode

BBMap will do fasta input. With ~4-5 kb contigs you should switch to using minimap2 (LINK) for alignments. You can simply use the fowleri genome you have as reference.

That said, you already have a hypothesis. Are you simply trying to convince yourself of the results by doing these additional manipulations?

ADD REPLY
0
Entering edit mode

I can only use tools that are available on Usegalaxy.org Europe Site. BBmap on there will only do FASTQ

Does Minimap2 align the contigs to a minimum percentage, or is there just an assumption that whatever contig it maps is correct??? I have heard that can be used, but I didn't see anything about a minimum percentage. I have also heard that I can use Minimap2 to map reads to the contigs in order to get a concentration.

Overall yes, I want to have complete confidence in my results. The last thing I want to do is present my findings, and not be correct. It also proved my hypothesis that Host removal was actually removing NF001 reads.

Like I said, it's been complicated. Depending on what online service I ran my reads through, because NF001 is not in their database. One site marked them at Mycobacterium Leprae, one Mycobacterium Tuberculosis Oman strain, one Plasmodium Ovale, Kraken2 with the Core_nt database a wide range of species against chopped up reference sequence. The Reference NF001 actually does match with these various species, but generally only a small region for each. I won't know if Plasmodium Ovale is actually present until I nail down the NF001 reads. Once I have these two species nailed down, I will just have an online platform do the analysis without these reads.

One question about Minimap2, my issue with it was that my mapped reads contained both pairs and singles, I assume I can't use mixed pairs and singles as input for Metaspades?

ADD REPLY
0
Entering edit mode

I have actually been working on this for about 2 years (not all samples), I only stumbled onto the Naegleri Fowleri NF001 about 4 months ago. It was the first reference that I found that actually mapped well to all of the samples, and it fit that all of the other species I was trying to chase down only had minimal mapping. I have started and restarted from RAW reads numerous times.

ADD REPLY
0
Entering edit mode

I think I have things figured out, let me know if this makes practical sense to you. So with the Naegleria Fowleri Karachi NF001 reference, about 1% maps to the Human T2t reference, so if I am only using reads that map perfectly to the Naegleria Fowleri reference there will be some extra reads that match T2t, but won't affect assembly.

So I used Bowtie2 with settings that would only give me reads that mapped perfectly to reference. So I initially mapped my reads trimmed to Q15 and first mapped them to my reference, followed by T2t. Then I trimmed again to Q20 and repeated the mapping.

This should give me all reads that map perfectly to my Naegleria Fowleri reference at 100%. The problem I was having before is that I was trying to assemble with Spades or Metaspades, but I think due to the high amount of repeats in the reference that assembly wasn't consistantly very good. So this time I assembled with MIRA and used my Naegleria Fowleri reference for assembly. So the reference shows 3,477 sequences, and Mira gave me back 3,477 contigs with at or near 100%. Most likely the minor inconsistency is due to trimming or single reads that I hadn't included in the assembly. Since the assembly was nearly complete at nearly 100%, I can use the 420,000 reads as my read count (this one was from the CSF Rhinorrhea sample), my blood samples are well over 1.5 million reads.

Since I can't reliably separate Naegleria Fowleri reads from human except at 100% match, I don't plan to use them. Though my remaining reads likely contain Naegleria Fowleri reads at less than 100% match, I need to remove them to analyze for bacterial reads. My assumption is that I can use Hisat2 and map for both T2t and Naegleria Fowleri to give me clean reads to do De Novo assembly??

ADD REPLY

Login before adding your answer.

Traffic: 3182 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6