Question: Kraken failing to map reads when reads are merged.
0
gravatar for dylan.lawrence
20 months ago by
dylan.lawrence10 wrote:

I am attempting to analyze a stool sample of unknown composition for contents using Kraken. The data are paired end Illumina reads. I performed the following for my analysis:

Using bbtools I merged and deduped the reads:

bbmerge-auto.sh in1=R1.fastq.gz in2=R2.fastq.gz out=merged.fastq.gz outa=adapters.fa
dedupe.sh in=merged.fastq.gz out=deduped.fastq.gz

Following this I ran the merged files through kraken:

kraken --preload deduped.fastq.gz > merged.kraken 
kraken-translate --mpa-format merged.kraken > merged.labels

The above kraken commands are just an example; I ran a second run of kraken without merging the reads:

kraken --preload R1.fastq.gz R2.fastq.gz > unmerged.kraken 
kraken-translate --mpa-format unmerged.kraken > unmerged.labels

The length of the merged.fastq.gz file is 375700 lines or 93925 and Kraken was able to map 3930 reads or ~4.2%

The length of the separate paired end files is 582004 or 291002 total reads and Kraken mapped 150567 or ~51.7%

What could explain why the merged reads are mapping so poorly and the un-merged paired-end reads are not?

bbtools kraken • 932 views
ADD COMMENTlink modified 19 months ago by Federico Lopez20 • written 20 months ago by dylan.lawrence10

You should take a few of the reads and see what is happening with the merging process. Were these reads designed to merge? Have then been pre-scanned/trimmed before merging? Have you tried to use plain bbmerge.sh rather than the auto?

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

So auto, from what I read, is just for the memory part. These are paired end reads 150bp each with a 250bp insert which means they are overlapping.

ADD REPLYlink modified 20 months ago • written 20 months ago by dylan.lawrence10

It appears that contrary to your expectation the insert sizes must not be what you expect. I suggest that you first try estimating that using one of the two methods from BBMap: C: Target fragment size versus final insert size

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

This is the result of running the memory intensive analysis:

Input:                          291002 reads            43650300 bases.
Unique Kmers:                   1085658
Load Time:                      44.879 seconds.

Total time: 61.960 seconds.

Pairs:                  145501
Joined:                 128227          88.128%
Ambiguous:              15118           10.390%
No Solution:            2156            1.482%
Too Short:              0               0.000%
Fully Extended:         7523            2.585%
Partly Extended:        86911           29.866%
Not Extended:           196568          67.549%

Avg Insert:             127.7
Standard Deviation:     56.1
Mode:                   83

Insert range:           35 - 447
90th percentile:        209
75th percentile:        160
50th percentile:        116
25th percentile:        84
10th percentile:        66

That looks like a good merge to me, am I missing something?

ADD REPLYlink written 20 months ago by dylan.lawrence10

This does look reasonable and in line with what one would expect. Then why are you missing so many reads in the merged file in stats you had posted above?

The length of the merged.fastq.gz file is 375700 lines or 93925 

The length of the separate paired end files is 582004 or 291002 total reads
ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

It appears my initial merge was using a less stringent merge, I've just redone the merge using that same script and got 512908 lines in the fastq file or 128227 reads.

I have now rerun the new merge2 file in kraken and that mapped only 1.93% of the merged reads. I'm really confused.

ADD REPLYlink written 20 months ago by dylan.lawrence10

Have you trimmed the data after merge to remove adapter sequences that may be present? Run bbduk.sh on the merged data.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

I followed the standard adapter trimming code and it trimmed very few reads and after trimming still only mapped 1.93% of reads.

ADD REPLYlink written 20 months ago by dylan.lawrence10

Based on the metrics above bbmerge.sh appears to be doing its job right. Have you taken a few merged reads and then blasted them at NCBI to see what you get? Are they mapping properly?

BBMap suite does have some tools to assign taxonomy. Let me see if I can find the guide for that.

Edit: Take a look at /bbmap/docs/guides/TaxonomyGuide.txt included in BBMap distribution.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

I've redone kraken with trimmed and untrimmed reads and the mapping percentage is still very low if I trim, however when I analyze the results, 86% of the untrimmed reads map to "root" which is not even domain specific. Meanwhile 100% of the trimmed reads map to Bacteria, while this is only 1.93% of the total reads, it seems better than the untrimmed result.

It may be that many of the unmapped reads are shorter due to trimming and so kraken cannot find good matches, but they may be useful for an ultimate assembly and variant checking.

ADD REPLYlink written 20 months ago by dylan.lawrence10
1

You could also try DIAMOND against NCBI nr (if you have enough compute resources available).

Are you only analyzing a subset of reads or you did not get a lot of data?

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax71k

This is the full set of reads, we got between 150,000 and 250,000 demultiplexed reads. This was spike-in data and it is mostly just a "what's there" type of thing. We are actually trying to do FACS on bacteria and wanted to test a computational method for making sure we have single bacteria, but the data I am working with was done using a dilution method, so there's likely several bacteria present.

I will definitely try DIAMOND to see how it performs.

ADD REPLYlink written 20 months ago by dylan.lawrence10

If these are from spike-ins of known references (e.g. ones with genomes present) I would think you should be seeing pretty high recovery w/ just R1, let alone assembled pairs.

It's definitely worth cross-checking w/ DIAMOND though; this plus taxonomic annotation (and maybe using MEGAN to get a quick visualization of results) gives a pretty good overall picture on what is present.

ADD REPLYlink written 19 months ago by Chris Fields2.1k

A few questions:

1) What are the read lengths? 2) What is the expected insert size?

You are getting a pretty substantial drop in sequences when merging, which makes me wonder whether those that are merging are in poor quality regions.

ADD REPLYlink written 20 months ago by Chris Fields2.1k

Read lengths are 150bp and the expected insert is 250bp.

ADD REPLYlink written 20 months ago by dylan.lawrence10
0
gravatar for Federico Lopez
19 months ago by
Federico Lopez20 wrote:

Any particular reason you are interested in classifying merged paired-end reads? If you use Kraken's --paired option, read pairs will be concatenated and classified together. According to the Kraken manual, concatenating the pairs together raises "sensitivity by about 3 percentage points over classifying the sequences as single-end reads". You can also use the option --check-names in combination with --paired to verify that the names in each read pair match.

Also, DIAMOND is excellent for searching with assembled contigs, but there are perhaps more appropriate and faster options if you are interested in classifying reads based on exact matches. For example, Kaiju, k-SLAM, or Centrifuge. Here is a recent review of read classification methods from Steven Salzberg's lab: https://www.ncbi.nlm.nih.gov/pubmed/29028872

ADD COMMENTlink written 19 months ago by Federico Lopez20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 665 users visited in the last hour