How to determine species of paired-end .fastq files (low mapping rates and inconsistent findings)
4
3
Entering edit mode
3.2 years ago
LRStar ▴ 200

This is somewhat of a follow-up from previous posts (most recently C: How to determine which NCBI sequence to map against (multiple sequences for sing ). I was recently given a paired-end .fastq files (where each read has about 150 bases). The wet-lab researchers believe it is from an isolated strain of bacteria likely Helicobacter (this may be questionable). It was taken from dolphin stomach.

My job is to determine the identity (closest species etc.) of these samples. I have been attempting this for weeks and am stuck. Below are a few approaches I have tried:

1. I created contigs using SPAdes. I blasted a few of the contigs. Some seemed to align to Helicobacter cetorum MIT 00-7128. So, I downloaded that as a reference and aligned the sample to it using BWA. It had only 13% mapping rate. I also noticed that some contigs (shorter ones) blasted to Mus musculus/ Homo sapiens.

I thought I should do a more sophisticated approach than blasting contigs. So, I tried (2), (3), (4) below:

2. Downloaded all “Complete genomes” of “Helicobacter” genus from NCBI (n=221). Used sourmash to compute k-mer sequences to determine relatedness of genomes. The highest similarity was 43.5% to Helicobacter pylori P12, but there were dozens of similarities with almost the same percentage to other Helicobacter strains.

3. Ran Kraken on Galaxy against a database of "plasmids", "viruses", and "bacteria". It had only 14% bacteria classified (and even less virus/plasmid), with most (5.66%) mapping to Helicobacter pylori.

4. Ran sendsketch on the contigs. It had WKID = 38.9%; KID = 0.05%; ANI = 96.6%; Contam = 2.5% for H. macacae (and similar numbers for H. mastomyrinus) and WKID = 1.1%; KID = 0.85%; ANI = 84.8%; Contam = 1.7% for H. cetorum MIT 00-7128.

I thought the numbers above seemed low (please let me know if you think otherwise). They also do not seem consistent (with H. pylori showing up more in sourmash and Kraken and H. macacae showing up most in sendsketch) - but still that probably does not matter given how low these values are anyway.

5. In my latest post (linked above), a user suggested taking small selections of reads (~20-25) and blasting. I did this for both the first 25 bases and the last 25 bases in the file. BlastN on the first 25 bases had Helicobacter cetorum MIT 00-7128 with lowest e-value (3e-39), megablast on the first 25 bases had "no significant similarity found message", BlastN on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (1e-25), and megablast on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (3e-19). I am not sure if there are certain parameters I should use (such as blastN versus megablast) and whether theses results seem consistent - especially because the highest WKID value from sendsketch was a different Helicobacter species.

I do not have much experience at all with this type of analysis. Over the weeks, I feel like I am working in circles and throwing different software at this sample. My main findings so far: 1) There may be contamination, 2) Mapping/alignment/similarity scores seem low, 3) Different software can point (despite being low in value) to different Helicobacter species.

The biologists press me to tell them what species their sample is and I have been unable to feel confident about giving an answer. My worries are that it could be that it is not bacteria (maybe archaea), is too contaminated, is not "isolated", etc.

For anyone with experience, what would you recommend to someone in my position to confidently determine what species this is (if that even seems possible at this point)? Specifically:

1. How should I determine if contamination should be removed and how should I do so safely (if needed)?

2. When does one feel confident they can report the species back to biologists? Are my numbers above indeed too low and conflicting?

3. How can I determine this is isolated bacteria? And not something else like archaea or pure contamination or multiple species?

4. What other approaches should I take to determine the species (especially interested to hear ideas that add something new to what I have already tried, i.e. are not simply throwing another software similar to the ones above at it) :oD .

I apologize for the long post (wanted to be clear about what I have tried). Thank you for sharing your ideas.

bacteria kraken spades blast sendsketch • 3.7k views
0
Entering edit mode

In case you're still stuck with this problem (I'm assuming not), I would be happy to guide you further or collaborate with you on this! Cheers

4
Entering edit mode
3.2 years ago
GenoMax 122k

I will only address a few things since there is so much to unpack.

It was taken from dolphin stomach.

Since we are starting there, there is no way to know what may have been in there to begin with. Idea that there would be only one species and that one would be able to identify it with certainty is very naive.

Some seemed to align to Helicobacter cetorum MIT 00-7128. So, I downloaded that as a reference and aligned the sample to it using BWA. It had only 13% mapping rate.

It is possible that there is some H. cetorum in this sample (13% if you want to go with your number). And then there could be tens of other things. It is also not surprising that you find Helicobacter spp. as prominent member. It is consistent with human stomach microbiome, which shows a similar trend (see paper linked below).

I also noticed that some contigs (shorter ones) blasted to Mus musculus/ Homo sapiens.

This could simply be contamination since a dolphin is unlikely to have those in their stomachs. It is possible that you may have some dolphin DNA that was in your sample (eukaryotic DNA) that is similar to human/mouse.

I did this for both the first 25 bases and the last 25 bases in the file.

I was the one who had suggested that. I hope that actually means 25 reads and not bases. If you took 25 bases from each read then please go back and re-do the blast analysis.

The biologists press me to tell them what species their sample is and I have been unable to feel confident about giving an answer.

Bottom line is this sample is unlikely to contain only one bacterial species since it was collected from an environment which could harbor many different bacteria. See this example paper for human gut microbiome. There is also a review on human stomach microbiome which finds more than one genera.

Also remember that you could be looking at data from species of bacteria/archaea that have never been seen before (dolphin stomach is pretty exotic habitat) so there may be no absolute identification possible at species level (unless they sample more dolphin stomach contents and you find similar sequences there).

You should perhaps do metagenomic data analysis to get an overall picture of what is in your data.

1
Entering edit mode

@genomax Thank you for your points. Just a few responses: 1) My advisor recently reiterated the scientists believe they successfully "isolated" one strain. I am not sure how reliable that is with wet-lab technology these days. They referred to a paper investigating possibly similar strains in dolphins (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0083177). 2) I apologize. You are right. I looked at 25 reads not 25 base pairs. 3) I look forward to reading those papers.

2
Entering edit mode

the scientists believe they successfully "isolated" one strain.

Your results seem to indicate otherwise though it is not possible to verify unless we can access the data.

Paper you linked does support one of your results. The predominant (if you can call it that) organism in your data can indeed be Helicobacter cetorum.

0
Entering edit mode

@genomax (and others). Thanks for your input. I do have the following thoughts to move forward and would be grateful to hear ideas for any of them (sorry this is a lot of questions). 1) Do you have recommendations for thoroughly removing "contamination" to see if mapping rates increase? 2) Are there any simple tools to check if this may be archaea instead of bacteria (or did my above analysis likely cover that)? 3) From a bioinformaticians perspective, does my analysis seem thorough? Would it be worthwhile to also do protein analysis instead of nucleotide analysis or would this likely lead to the same result? 4) They also request me to recreate Table 2 in the paper I linked in my just-posted comment (the Table shows number of genes in the sample, %coding in the sample, presence of specific pathogenic genes in the sample, etc.) Would recreating such a Table be impossible if I am unable to identify the species in the first place (And/or if this is indeed a metagenomic dataset like you suggested looking into)?

1
Entering edit mode

Do you have recommendations for thoroughly removing "contamination" to see if mapping rates increase?

Once you get away from the idea that there is/can be only one species in your data the rest would not be contamination but just other members of the microbiota that you still need to identify.

Would it be worthwhile to also do protein analysis instead of nucleotide analysis or would this likely lead to the same result?

Depending on how much time/effort you are willing to spend on this you could. You could also use something like DIAMOND to blast the entire read set against NCBI nr or RefSeq_bacteria. But before you do that follow some of the other suggestions in answer posted by @ctseto to look at the metagenomic composition of the sample rather than focusing on species level identification.

0
Entering edit mode

Thanks @genomax. This is a lot of useful information and I have been parsing through the advice from everyone. As a newbie, I may be missing this general idea. But you suggest looking "at the metagenomic composition of the sample rather than focusing on species level identification". Isn't my approach with Kraken looking at the "metagenomic composition" (i.e. what percent of the sample belongs to what genus/species)? If Kraken is not looking at "metagenomic composition", then I am just trying to get a picture of what the output from a "metagenomic composition" would look like. Sorry again if I may be missing a big picture. Thank you.

2
Entering edit mode

I don't know if you downloaded the pre-made kraken database or made your own but pre-made kraken database is not an exhaustive compilation of all known bacterial species. Kraken is mainly a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences so as such your results are going to depend on what is in your database. On other hand programs like MetaPhlAn2, can do profiling of composition of microbial communities.

0
Entering edit mode

@genomax Thanks for your advice. I ran the sample through MetaPhlAn2 (first on forward reads R1.fastq and then on backward reads R2.fastq). In both cases, the output (profile.txt) returned back as "100% unknown". I then ran the same protocol on another sample provided by the same group of biologists (they said this was a standard isolated dataset), and it came back as "100% Helicobacter_cetorum". Is it reasonable for me to think that their sample of interest (the unknown one) is either such a new strain (or group of strains) that is maps to no clade-specific marker genes? Even if this were a diverse metagenomic sample from the dolphin stomach (i.e. not "isolated" as they say), shouldn't I at least expect the MetaPhlAn2 results to the have some species listed at different percentages (rather than 100% unknown)? Thank you again for sharing your ideas and expertise.

2
Entering edit mode

Based on your SPAdes data and alignments we know that there is some H. cetorum there. You have done your due diligence and have tried most things others have recommended here. Since you (and your collaborators) lack expertise in this area you may want to find a metagenomic expert (or collaborator) who can directly access your data/results to provide further guidance.

I hate to mention this but some of the data could simply be sequencing/library artifacts. The sequence may look great and have good Q-scores but they may not be real.

1
Entering edit mode

I hate to mention this but some of the data could simply be sequencing/library artifacts. The sequence may look great and have good Q-scores but they may not be real.

how does one detect that? Detecting adapters and the like based on the sequences from the Illumina letter? kmer distributions that look too uniform? BBDuk time?

1
Entering edit mode

I only mentioned that as a remote possibility. My assumption is that @LRStar cleaned up the sequence data before doing any analysis, so we should not need to worry about adapters.

I am mainly thinking of those contigs that seemingly have no "hits" to any data in Genbank. By now we have sequenced enough organisms that there should be something in the databases that is somewhat similar. If this analysis has not been done (that was one of the reasons to ask for the stats below) then @LRStar needs to check on that. Reads that assembled into such contigs would be artifact candidates.

1
Entering edit mode

If it was metagenome I might think it was host which would be unclassified in a bacterial | virus | plasmid kraken database...

1
Entering edit mode

@ctseto Thanks for sharing the thoughts. I did try blasting 20-25 reads from the unclassified from Kraken and most of it seems to be most related to Helicobacter (although some returned with "no significant similarities found"). Since blast can only look at small pieces at a time, I find it still hard to get a picture of what this data is. Perhaps I can try to map these reads directly to several reference genomes (like dolphin/human) and see how much of these unclassified reads map?

1
Entering edit mode

That might help, then visualize using Mauve, which might be best for this purpose; or generating a mummer plot to see how things line up. A Mauve plot will help you determine if there's a ton of repeat content (which might point at the assembler failing to resolve a tangle)

1
Entering edit mode

Can we get some stats (should have asked this before) on the data?

1. How many reads (and their length) in your original dataset? Did you scan/trim your data before doing any other analysis?
2. How many contigs did you get from SPAdes? Did you use metaSPAdes option?
3. How many contigs showed "hits" to known sequences?
1
Entering edit mode

@genomax Thanks, yes.

1) There were ~240,000 reads, each with a length of 150 base pairs. I did not scan/trim the data but I did run a FASTQ check. I am uploading the read 1 FASTQ output here and read 2 output here. The one item that failed is the "per base sequence content". It has been on the back of my mind and sorry I did not mention it earlier. According to the FASTQ outputs, it seems the per base sequence content may be slightly off at the beginning AND end of the sequence, and some resources, suggest seeing this on BOTH ends of the reads may be due to aggressive trimming. Do you think this may be the case and is enough concern for me to ask the wet lab biologists if they had trimmed this?

2) I got 39 contigs from SPAdes (ranging in length from 78 to 498771 with a median of 277) and (ranging in coverage from 0.415 to 771 with a median of 20.918). It seems I did not run metaSPAdes (I used command: spades.py -k 21,33,55,77 --careful -1 R1.fastq -2 R2.fastq -o SpadesOutput).

I can also return and try metaSPAdes option (it seems this considers the sample as metagenomic)?

3) I can definitely check this out too. Earlier, I had blasted several of the contigs to get a general feel (which is when I noticed smaller contigs sometimes mapped to mouse/human). What would you recommend a user to do to check "hits" for known sequences. Some of my contigs are very long and they could result in "no known sequence similarity" in BLAST (after searching 60 minutes). Still, do you recommend simply uploading one-by-one each contig into BLAST (and using nc/nr with megablast and/or blastn)?

Sorry again for a long post. Thank you for all your suggestions.

2
Entering edit mode

There were ~240,000 reads, each with a length of 150 base pairs

That is not a lot of data for a metagenomic WGS study. If we consider a bacterial genome to be ~4.5Mb you have roughly ~8x coverage (in terms of just base pairs) if there was a single genome in your data.

I did not scan/trim the data but I did run a FASTQ check.

It is possible that your data was clean but for any de novo data analysis it is best to scan/trim your data to confirm absence of extraneous sequence. I suggest you use bbduk.sh from BBMap suite. FastQC does not look at every read in your dataset but does a sampling based analysis. That is normally ok for alignment based analyses. Having extraneous sequence in your data can lead to odd results if you are assembling the data.

is enough concern for me to ask the wet lab biologists if they had trimmed this

If all of your reads were 150 bp long then the data likely has not been trimmed. Do you know if data was selected in some way? 240K reads is low for a WGS study.

Contigs below 500 bp are probably not very informative and should be filtered out from further analysis. You should definitely re-run SPAdes with the meta option on (on trimmed data) and see what you get.

Are you able to setup BLAST+ locally and run? Did all blast searches finish or you gave up after 60 min assuming there was no result? The largest contigs can be split in smaller chunks for purpose of blast search, if you want to speed things up a bit.

Finally if you have the compute resources available it may be worth running DIAMOND (on scanned/trimmed data) against nr to see what you get. Since you only have 240K reads that should not be too bad. You will need a machine with ~60-90G of free RAM to run DIAMOND.

1
Entering edit mode

Thank you so much @genomax again for your suggestions. Sorry for late reply (traveling with inconsistent internet).

1) Communication with wet lab biologists has been hard (language/culture differences). I will inform them that the reads may be low and ask how they selected their data. One (possibly) good sign is they gave me 4 samples and the only one that I can make sense out of so far (it maps to H. cetorum MIT 00-7128 in numerous software) actually had the lowest reads (240K). So maybe low read counts may not be a problem. The largest has about 480K.

2) I was running BLAST searches on the contigs (about 40 per sample) one by one. For the longest contigs, BLAST searches would finish but sometimes give me an error message "No significant similarity found". Moreover, different contigs would map (somewhat) differently than other contigs. I was left feeling unsure how to interpret the results because I ended up with 40 BLAST results (one for each contig). That is in contrast to other software that give one final result for the entire sample at once (easier to interpret).

I was not aware of BLAST+ locally (as you can see, I am not too experienced!) But that seems like a great idea to overcome size restrictions. I will look into it!

Conceptually, I am curious about the idea of splitting up large contigs into smaller chunks for BLAST. For some reason, I would have thought doing so would jeopardize information and lead to strange results. But if that is standard practice, I will change my thinking on it.

3) Thanks for your suggestions about DIAMOND. I do not have the resources locally, but I tried this on Galaxy. I received an output: "Reported 0 pairwise alignments, 0 HSSPs. 0 queries aligned." for all samples (including the one that usually aligns well to H. cetorum MIT 00-7128). So, I am going to open that in a different forum to figure it out.

Side note: My advisor seems to think these samples probably belong to the genus Helicobacter but that, because that genus is so genetically diverse, possible new species (if that is indeed what these samples are) will not map well to anything. The samples are from Japan, which he thinks may be less represented in the databases. He thinks if I redid the analysis and set default cutoffs less stringently, I would find that they all map well to Helicobacter (he had seen such phenomenon working with CLC Genomics Workbench.

2
Entering edit mode

I am curious about the idea of splitting up large contigs into smaller chunks for BLAST. For some reason, I would have thought doing so would jeopardize information and lead to strange results.

Since your aim is to try and identify what is in your data this should not be a major problem as far as identification goes. Splitting was for reducing the query size, so jobs would finish quicker.

I added a comment to your DIAMOND thread. You can check that.

He thinks if I redid the analysis and set default cutoffs less stringently, I would find that they all map well to Helicobacter

Your aim should be to let your data speak for itself, rather than trying to get a result you want from it.

1
Entering edit mode

Thank you so much @genomax. I agree with your sentiment about letting the data speak for itself. My advisor is wetlab by background and does not work with command-line tools. He just noticed on CLC Genomics Workbench that less stringent parameters made the data map better to Helicobacter. His thinking seemed to be that perhaps these command-line software I am using (usually at default parameters) might have strict parameters that are better-suited for eukaryotes, whereas bacteria strains (especially Helicobacter and maybe even especially so from exotic dolphin data from Japan) could be much more diverse/unclassified. So, I think he thought maybe our data itself was actually calling for less stringent parameters. I feel uncomfortable softening parameters without first finding a reason to do so from people who are indeed experienced in command-line software and bioinformatics in general (like you and others on this forum). So, for now, I am mostly still using default parameters unless it seems to do so otherwise from vignettes or suggestions on this forum too.

1
Entering edit mode

I would not describe Metaphlan2 as a tool suited for eukaryotes. Kraken2 is just a kmer lookup database, and the fact that you need to use separate databases makes me think you'd be better off running this in command line as well, perhaps after all the other jobs. It doesn't use too much memory, but this depends on available computing power. In this case, using MaxiKraken or another database may be more helpful than trying to integrate the results of bacteria/plasmid/viruses kraken databases (You are using Kraken1 or Kraken2?)

0
Entering edit mode

Thanks @ctseto. I am embarrassed to say I am not entirely sure whether I am using Kraken1 or Kraken2. I have been using Kraken on Galaxy Version 1.2.4. I was unable to install Kraken locally but will very likely try again.

0
Entering edit mode

I'm guessing Kraken1, as you mention bacteria | virus | plasmid, which I vaguely remember are the databases used in Kraken1.

0
Entering edit mode

disregard this post, superfluous.

0
Entering edit mode
1. We're not saying that there isn't Helico, but fixating prematurely on "all map well to Helico" when it may not all be 100% Helico (or even a single strain of Helico!) could be a problem. One possibility is multiple strains of helico that got stitched together, which from Metaphlan, might not be resolvable beyond genus level, which doesn't help you very much. If you can't tease them apart, simply noting that Helicobacter is in dolphin stomachs isn't going to be as awesome/publishable as genome announcements.

You may have more than one strain, or it is one strain that is at a nexus between more than one strain, such that the databases are catching bits and pieces of multiple strains in the reference. Being careful to distinguish between case one (multiple) and case two (one with many pieces) is going to be useful here, and may well be questions that are asked at peer review.

I suspect you'll probably have to re-run this assembly job after cleanup. I'm hoping that it'll help with the artifacts. I do love spades to tears (it was my second assembler after velvet), though lately I've been impressed with megahit, which runs rather quickly and with less memory footprint than SPAdes. However the general consensus is that spades is slower and more thorough.

1
Entering edit mode

Kraken2 should be decent enough at parsing out Archaea from Bacteria; though its database might not cover all marine life: might be good to keep tabs on the unclassified column.

As for Table 2, getting genome size, GC content and the like doesn't require a priori knowledge of the organism. As for gene annotations, it means you'll have to use more unsupervised methods, like running blastx/diamond against the ORFs after running an orfcaller, versus mapping gene snippets of known genes of interest from a genus/species of interest against your isolated genome.

It may well be contaminated with multiple things...something like checkm which assesses counts of housekeeping genes and the like may give you some insight into whether or not some parts are overly redundant (which may indicate contamination).

3
Entering edit mode
3.2 years ago
ctseto ▴ 310

I created contigs using SPAdes. I blasted a few of the contigs. Some seemed to align to Helicobacter cetorum MIT 00-7128. So, I downloaded that as a reference and aligned the sample to it using BWA. It had only 13% mapping rate. I also noticed that some contigs (shorter ones) blasted to Mus musculus/ Homo sapiens.

"aligned the sample" meaning you mapped the PE reads to the reference?

How should I determine if contamination should be removed and how should I do so safely (if needed)?

If you want to be really sure about contamination, if they sequenced negative controls you could compare kmer overlap of your negative controls to your samples.

When does one feel confident they can report the species back to biologists? Are my numbers above indeed too low and conflicting?

Do they want a microbiologist-style "is bug X here or not" / "which Genus X did you find" or a description of the stomach microbiome? If they have an a priori hypothesis they wish to test, you could report that after mapping reads to that reference genome that you didn't get complete overlap, which could be a overlap issue or a different species of same genus.

If the hypothesis is "which genus may be in my sample", using your kraken2 .report it should give some percentages for given samples (note that there is mini kraken, standard kraken and maxikraken databases floating about, which will affect results). Digging through your kraken .out it'll give the C/U (classified/unclassified), taxonomy, nt, and the kmer composition. You can check the output for all the Helicobacters you're interested in and check out how many kmers in a contig were used to call it, which can be a useful confidence measure. If a contig is say 0:100 Helicobacter_number_here:1, then 100 kmers were unclassified and it called it based on 1 kmer, which might not be as exciting as something with a high number of kmers.

How can I determine this is isolated bacteria? And not something else like archaea or pure contamination or multiple species?

Multiple related species might be a tricky one to parse out with WGS alone. If the assembly graph is tangly it'll complicate things. Confidence in species calls might increase if the wetlab folks attempted to culture (selective/differential), then isolate, grow up and sequence, versus MAGs derived from a metagenome. Your kraken report should give some insights into community composition. Running metaphlan2 on the paired end reads may give a different picture.

Another option for contamination is something like VizBin, which will let you bin things out by hand. It's a bit more...manual...than I'd like, but for a reasonably uncomplicated sample like stomach it may be helpful. You can "gate" (to borrow the term from flow cytometry) contigs in a 2d tSNE grid, export them, and then figure out their taxonomies and the like. And if all your helicobacters are clustered together in the same blob, you can seperate them visually from the rest of the sample and carry out your annotation that way. If you downselect too soon (eg, only the reads/contigs that map to a given Helicobacter) you might be throwing too much out too soon.

What other approaches should I take to determine the species (especially interested to hear ideas that add something new to what I have already tried, i.e. are not simply throwing another software similar to the ones above at it) :oD .

While you're at it, running any of the metagenome binning pipelines may be useful. Bin integrity could be checked with checkm, et al. Then individual bins assessed for function/genome architecture/taxonomy. an'vio is a pretty useful turnkey program to do most of these functions, and I'm retooling towards using an'vio myself.

An additional resource to the segata paper is https://astrobiomike.github.io/genomics/ .

1
Entering edit mode

Lastly, don't feel bad if your "isolated genome" (which may well be a MAG) has a bunch of contamination, this is a known problem in the field, you wouldn't be the first and won't be the last. Make a good effort to pull the stuff out, though. Mick Watson had an interesting post on the subject of reference genomes being somewhat problematic (, which does make me feel a little...ill when I wonder about whether or not this propagated into my Kraken2 database files! Mick Watson's approach would be recapitulated in his lab's MAGpy (https://github.com/WatsonLab/MAGpy).

If they really want to be sure about the assembly graph and minimizing assembly errors/artifacts before calling genome size, gene calling and the such (but not addressing multiple helicobacter strains or just multiple isolated strains in general), then there's long read (Nanopore or PacBio); and the former could probably be done much closer to the dolphin stomach if that's something that interests your advisor or collab.

1
Entering edit mode

@ctseto. Thank you for your very helpful answer. It is overwhelming how much I do not know about this subject area yet but your response helps familiarize me with several possible tools and perspectives. A couple notes. (1) Yes, when I said "aligned the sample", I meant I mapped the PE reads to the reference. (2) They did not have a prior hypothesis. They seem convinced they isolated and cultured a single strain (and they wonder if it could be a new strain). As a result, they want me to reproduce something like Table 2 in this paper. You seem to indicate that I might still be able to reproduce this Table using non-supervised methods (which I hope to eventually try). (3) Thanks for your suggestions looking into the Kraken output. The results for this sample can be viewed here. It seems to mostly consist of H. pylori and H. cetorum (although 85% is "unclassified"). I am really not sure what to think of that (other than what you suggested earlier that the database may not have a lot of marine life). When I look at the Kraken output in the format you suggested, the number of kmers seems low in many cases, as an example for H.pylori with an ID of 210 (0:120 A:31 0:47 210:2 0:72). (4) I can inform them about long read (Nanopore and PacBio) as you suggested. (5) I look forward to possibly trying MetaPhlAn, VizBin, and MAGpy. Can I ask at a high level: What in your opinion is the additional benefit/new perspective with using MetaPhlAn on the data (as opposed to Kraken)? My current belief is that they both help determine the abundance of different species in a sample but they use different measurements to obtain those estimations (with Kraken using k-mer matches and MetaPhlAn using gene markers)? Are results between the two usually consistent (just trying to get a general idea)? Thank you so much again for your helpful inputs.

1
Entering edit mode

@ctseto. Thanks for your advice. I ran the sample through MetaPhlAn (first on forward reads and then on backward reads). Both cases, it returned back as "100% unknown"! I then ran a similar sample by the same group (they said this was a standard dataset), and it came back as "100% Helicobacter_cetorum". This really makes me think this sample is either really new organisms or very messy (although if it were messy, I guess MetaPhlAn would just return back numerous species).

1
Entering edit mode

blast chunks of your kraken-unclassified contigs with BLAST

1
Entering edit mode

Metaphlan estimates from species specific markers, which is probably better at relating back to relative abundance. As for Kraken, its percentages are % of contigs (omitting PE reads and coverage per contig), so they are different measures of abundance.

1
Entering edit mode

In re Kraken; consider alternate databases (this will require local deployment)

https://lomanlab.github.io/mockcommunity/mc_databases.html

3
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 21k

Here is an idea that hasn't been mentioned yet: try to find small ribosomal subunit rRNAs in your sample. That can be done with SSU-ALIGN, RNAmmer or Barrnap. Assuming you identified some 16S/18S rRNA sequences, their placement within a larger SSU trees of prokaryotes and eukaryotes should give you a good idea about what is present in your sample.

Separately, you did a good job with what you have tried so far. I would say that your results do indicate the presence of Helicobacter-like microbe, but it is tough to say how prevalent it is.

1
Entering edit mode

Thank you @MensurDlakic. I appreciate your advice to look into small rRNAs and will read more about the software you suggested. I also appreciate your confirmation that in your opinion my analysis has been thorough so far. With your experience, you seem to believe that my results indicate the presence of Helicobacter-like microbes. I am trying to educate myself why you (and others) do not seem alarmed with the high rate of "unclassified" reads from Kraken (85.96%), as can be seen here. I am curious to know your viewpoint. When interpreting these Kraken results, most species seem to belong to H. pylori (5.66%) and H. cetorum (4.81%). So, even though these percentages are low (5.66% and 4.81%), should I still consider this strong evidence that the majority of this sample is composed of these species (since they had the highest percentages, despite being low?)

2
Entering edit mode

I am trying to educate myself why you (and others) do not seem alarmed with the high rate of "unclassified" reads from Kraken (85.96%), as can be seen here.

I don't know enough about dolphins to be certain, but my guess is that their microbiota have not been well characterized. If so, it could be that microbes from dolphin stomach are so unique that most of their sequences do not match presently known microbial sequences. By the way, how current is the Kraken database?

So, even though these percentages are low (5.66% and 4.81%), should I still consider this strong evidence that the majority of this sample is composed of these species (since they had the highest percentages, despite being low?)

I would not make that assumption. In the most extreme case, it is possible that the whole unclassified fraction (85%) belongs to one species or couple of related species that we have not seen before. The only thing that seems clear at this point is that Helicobacter is most abundant with regard to known bacterial species.

0
Entering edit mode

Thank you @MensureDlakic. I see your thinking better now. The Kraken database that I used was on Galaxy version 1.2.4 (the latest). I just tried to figure out how current that Kraken database is (I ran it on the three available Kraken databases on Galaxy "plasmids", "viruses", and "bacteria") - but I could not figure it out. I can look into it.

1
Entering edit mode

One strong possibility in that case is that the unclassifies are Eukarya. Have you tried throwing those contigs into blast (eg, blast against ncbi nr/nt) to see what comes up?

0
Entering edit mode

Thanks again for your help @ctseto. That is a good idea. As in point (5) of my original post, I had previously blasted 20-25 reads as suggested by @genomax. These unclassified reads (as they constitute 85% of the data) will likely have overlap. However, I tried it again by blasting the first 25 reads, last 25 reads, and middle 25 reads. In each case, I used nr/nt as you suggested and used megablast, discontinuous megabalst, and blastn. Using megablast, all three sections resulted in "No significant similarity found". Using discontinuous megablast, two sequences showed Helicobacter cetorum and pylori strains and one showed Legionella longbeachae strain (but with e-value 8.9 and percent identity 75.86%). Using blastn, two sequences again showed Helicobacter cetorum and pylori strains and one showed Cucumis melo genomic chromosome (but with e-value 0.0001 and percent identity 81.03%). I am not sure if my approach is recommended and am happy to try other approaches.

1
Entering edit mode

This makes me think some kind of Helicobacter that doesn't have markers for Metaphlan to pick up. But you'll want to convince yourself that there isn't some kind of bizarre de novo assembly graph tangle here. And if not, then it's some weird helicobacter rearrangement.

Kick off a binning job and assess the contigs that go into the bins and see what their estimated functions are. If you get a lot of redundancy it might well be assembly artifacts.

Pivoting back to the limited scraps of bench microbiology (I was ugrad micro before switching to bioinformatics), did you actually streak it out yourselves (assuming this is the appropriate way to culture Helico) or did you just sequence what was given to you?

1
Entering edit mode

@ctseto Thank you for your comment. It might take me a while to get to performing a binning job (using VizBin I think with what you suggested earlier), but I look forward to trying it too. Conceptually, if I see redundancy and there are assembly artifacts, as you state, would there be a program to remove these artifacts (or would I just do that by hand)? Also, thank you for your question. I have zero wet lab skills (and apparently very little computational skills, ha) for this topic. I only received the .fastq files from the wetlab scientist group.

1
Entering edit mode

A complex path could easily be a tangly genome with lots of recombination or multiple related strains, and teasing those apart with just paired end is tricky; but you can still generate draft genomes and try to annotate.

It would be tempting to just focus on the Helicobacters, but the "unknowns" are also interesting.

I am unsure if you want to keep using this dataset, or if you want to pass your NGS data through host removal, then bbduk/pre-cleaning and then re-do the assembly.

0
Entering edit mode

Also @ctseto. I have seen contigs (although it seemed to be the shorter ones) blasted to Mus musculus/ Homo sapiens. (I mentioned this in point (1) of my original post). So, there may be some eukarya (although possibly only contamination). Not sure if there is an easy but effective method to check for how much eukarya there may be, whether contamination or not.

1
Entering edit mode

You could probably try mapping reads to a host genome. I see that Baylor HGSC was working on a dolphin genome project, though I'm not sure how confident I am in it, and estimate based on PE read numbers.

Then you passthrough all the reads that did not map (if memory serves, f13 + f64/f128 for read one and read two) to create two new fastqs, which you then subject to a standard cleanup pipeline and then metagenomic de novo assembly.

I know Todd Treagen has pushed for MetaCompass (http://metacompass.cbcb.umd.edu/) ; which attempts to "guide" the assembly based on a database of references, although this method does carry its own risks, and I remain fond of the blind assembly, despite the intrinsic possibility of errors.

0
Entering edit mode

Thank you @ctseto. I was unaware of the Baylor HGSC dolphin genome project and can look into using it as a host genome. I am sorry to ask, but could you kindly expand on "f13 + f64/f128 for read one and read two"? I tried Googling those values but remain a bit clueless. Thank you very much again for your support here.

1
Entering edit mode

It's a sam flag: https://broadinstitute.github.io/picard/explain-flags.html

read paired (0x1)
mate unmapped (0x8)


is 13. You add 64 and 128 to 13 for your flags if you want to specifically pull read one or read 2.

In this case you map reads to dolphin and you want the reads that are paired (R1 and R2 in the mapping) and f13 specifically pulls reads where R1 and R2 did not map to host.

0
Entering edit mode

So cool. I have never done that procedure before and will likely start doing so especially with these datasets that seem like they need thorough cleaning.

0
Entering edit mode

@ctseto. I am attempting to map the reads to a host genome (and also run QUAST on my metaspades contigs using the host genome as a reference). This might be a very basic question, but I am unsure how to download an appropriate (at least starting point) reference genome for dolphin.

1) I tried typing "dolphin" into RefSeq, which resulted in thousands of output. Even after filtering for only "animals" and only "mammals", there were hundreds of output and many were still strains of organisms from dolphin (i.e. not the dolphin reference genome itself).

2) I tried looking through papers (like here) and looking at their NCBI accession (in this case GCA_001922835.1) and searching on NCBI RefSeq but got no results.

3) I went to the Baylor HGSC dolphin genome project that you kindly linked. I clicked on "Draft genome assembly at NCBI", which took me here and states it "contains no sequence data".

I guess I am wondering if finding a reference genome usually requires lots of digging. If so, I will keep looking. However, if there is something about my approach that is inefficient (due to my lack of experience) and you happen to have a recommendation on how to more intelligently :) search for reference genome, I would greatly appreciate it! Thank you so very much again.

2
Entering edit mode

Sequence data for dolphin is available here. There are 500K+ contigs. Click on the Download tab and then get the fasta format files.

I think you are putting in a lot of effort for a really small amount of data that may be of questionable provenance (you don't know how the biologists got that data, if it was pre-processed in any way and why there is only a small amount). Of 240K reads you have how many align to H. centorum and how may align to nothing?

1
Entering edit mode

@genomax Thank you for your comments. I used BWA to align each sample to the reference genome (H. cetorum MIT 00-7128). Then, I used Qualimap to assess the alignment. The first sample had 85% mapping rate, but the other samples (which all seem similar to each other and are the ones I am struggling with) had only 13% mapping rate. I hope these are the values you are asking about, but if not, I can do other approaches.

Side note: I reran SPAdes using the --meta option. The results remain similar to when I had ran SPAdes without the --meta option (for QUAST on the contigs, sendsketch, and Kraken). Sendsketch does show much higher WKID for H. cetorum MIT 00-7128, whereas the other three samples had high WKID for H. macacae. However, Kraken showed highest species being H. cetorum for all four samples (although the rate was 73% for the first sample and only 20-40% for the last three samples). Dramatically, QUAST had 78.3% genome fraction for first sample and ~1% for the last three samples when aligned to MIT 00-7128.

1
Entering edit mode

Those were values I was asking about. 85% alignment rate is pretty good. I would still find out what is different about samples 2-4. There has to be something more to them than what you know at present.

0
Entering edit mode
2.9 years ago
jbnrodriguez ▴ 30

As @MensurDlakic suggested, finding rRNA genes would be key. One easy way to do this would be to use phyloFlash on your library reads (works even with raw reads; link for the tool here: https://hrgv.github.io/phyloFlash/). Also, take a look into the GC-coverage plots of your assembly with each blob on your plot being taxonomically annotated using the NCBI nt database. You can conveniently generate such a plot by using gbtools (manual here: https://github.com/kbseah/genome-bin-tools/wiki/0.-Introduction). If there are contigs representing eukaryotes in your assembly, they will appear at the bottom end (meaning low coverage) of your GC-coverage plot; bacteria will be represented by higher coverage contigs that may (or may not) also differ in their GC-contents. If there are multiple bacterial taxa detected, they will appear as differentially colored blobs with a distinct taxonomy, GC- and coverage-spread on the plot. Using barrnap along with gbtools, you can additionally see on the plot the SSU rRNA harboring contigs (along with their taxonomic affiliation). I tend to not that heavily rely on read-level metagenomic classifiers such as Kraken2 or metaphlan2 (their results need to be always taken with a pinch of salt!). As a final step, I would run CheckM on your assembly to estimate completeness/contamination. Hope this helps!