Question

Large Discrepancy Between Short Read Mapping and Kmer Analysis

0

Entering edit mode

5.8 years ago

nickeener ▴ 60

Hi all, I'm working on a project that involves trying to develop a method for identifying whether a specific sequence (specifically looking at viral genomes) is contained within short-read sequencing data or not. I'm using this tool that creates a searchable index of all kmers that occur more than twice (to account for sequencing errors) from a number of fastq files which can then be queried with the sequence of interest and returns, for each fastq, the proportion of query kmers that also appear in that sequencing run. This is much faster than performing short-read mapping with something like BWA or bowtie so could be quite useful.

The problem is, for some datasets, the results make no sense. The below table is a result from querying a set of 8 RNAseq runs from human T cells, 4 that were infected with HIV and 4 controls, with the HIV1 ref seq (NC_001802.1). The k value used here was 20 (other k values have been tested with little improvement). Bowtie mapping was used to confirm whether the viral sequence was present or not.

Run_Number Infection_Status Kmer_Match_Proportion Bowtie_Mapping_Proportion

SRR2648303  Infected    0.9811207   0.40983

SRR2648299  Infected    0.9781254   0.42032

SRR2648305  Infected    0.9738461   0.35961

SRR2648301  Infected    0.9732676   0.35042

SRR2648293  Control     0.8603503   0.00032

SRR2648294  Control     0.8415871   0.00036

SRR2648297  Control     0.8244479   0.00029

SRR2648296  Control     0.8077031   0.00019

As you can see, the infected runs all have very high kmer match proportions and correspondingly high bowtie mapping proportions (number of mapped reads/total number of reads) confirming that those runs contained HIV. The control runs however also had fairly high kmer match proportions (though distinct from the infected runs) but had significantly lower bowtie mapping proportions, indicating that those runs did not contain HIV. So if there was no HIV in those runs, how do they have 80-86% of HIV's kmers?

short-read mapping Kmer Counting • 1.2k views

ADD COMMENT • link updated 5.8 years ago by GenoMax 152k • written 5.8 years ago by nickeener ▴ 60

0

Entering edit mode

I'm using this tool that creates a searchable index of all kmers

Which tool is that?

ADD REPLY • link 5.8 years ago by GenoMax 152k

0

Entering edit mode

HowDeSBT (https://github.com/medvedevgroup/HowDeSBT)

ADD REPLY • link 5.8 years ago by nickeener ▴ 60