Question: Large Discrepancy Between Short Read Mapping and Kmer Analysis
0
gravatar for nickeener
5 weeks ago by
nickeener0
nickeener0 wrote:

Hi all, I'm working on a project that involves trying to develop a method for identifying whether a specific sequence (specifically looking at viral genomes) is contained within short-read sequencing data or not. I'm using this tool that creates a searchable index of all kmers that occur more than twice (to account for sequencing errors) from a number of fastq files which can then be queried with the sequence of interest and returns, for each fastq, the proportion of query kmers that also appear in that sequencing run. This is much faster than performing short-read mapping with something like BWA or bowtie so could be quite useful.

The problem is, for some datasets, the results make no sense. The below table is a result from querying a set of 8 RNAseq runs from human T cells, 4 that were infected with HIV and 4 controls, with the HIV1 ref seq (NC_001802.1). The k value used here was 20 (other k values have been tested with little improvement). Bowtie mapping was used to confirm whether the viral sequence was present or not.

Run_Number Infection_Status Kmer_Match_Proportion Bowtie_Mapping_Proportion

SRR2648303  Infected    0.9811207   0.40983

SRR2648299  Infected    0.9781254   0.42032

SRR2648305  Infected    0.9738461   0.35961

SRR2648301  Infected    0.9732676   0.35042

SRR2648293  Control     0.8603503   0.00032

SRR2648294  Control     0.8415871   0.00036

SRR2648297  Control     0.8244479   0.00029

SRR2648296  Control     0.8077031   0.00019

As you can see, the infected runs all have very high kmer match proportions and correspondingly high bowtie mapping proportions (number of mapped reads/total number of reads) confirming that those runs contained HIV. The control runs however also had fairly high kmer match proportions (though distinct from the infected runs) but had significantly lower bowtie mapping proportions, indicating that those runs did not contain HIV. So if there was no HIV in those runs, how do they have 80-86% of HIV's kmers?

ADD COMMENTlink modified 5 weeks ago by genomax73k • written 5 weeks ago by nickeener0

I'm using this tool that creates a searchable index of all kmers

Which tool is that?

ADD REPLYlink written 5 weeks ago by genomax73k

HowDeSBT (https://github.com/medvedevgroup/HowDeSBT)

ADD REPLYlink written 5 weeks ago by nickeener0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1734 users visited in the last hour