Hi all, I'm working on a project that involves trying to develop a method for identifying whether a specific sequence (specifically looking at viral genomes) is contained within short-read sequencing data or not. I'm using this tool that creates a searchable index of all kmers that occur more than twice (to account for sequencing errors) from a number of fastq files which can then be queried with the sequence of interest and returns, for each fastq, the proportion of query kmers that also appear in that sequencing run. This is much faster than performing short-read mapping with something like BWA or bowtie so could be quite useful.
The problem is, for some datasets, the results make no sense. The below table is a result from querying a set of 8 RNAseq runs from human T cells, 4 that were infected with HIV and 4 controls, with the HIV1 ref seq (NC_001802.1). The k value used here was 20 (other k values have been tested with little improvement). Bowtie mapping was used to confirm whether the viral sequence was present or not.
Run_Number Infection_Status Kmer_Match_Proportion Bowtie_Mapping_Proportion SRR2648303 Infected 0.9811207 0.40983 SRR2648299 Infected 0.9781254 0.42032 SRR2648305 Infected 0.9738461 0.35961 SRR2648301 Infected 0.9732676 0.35042 SRR2648293 Control 0.8603503 0.00032 SRR2648294 Control 0.8415871 0.00036 SRR2648297 Control 0.8244479 0.00029 SRR2648296 Control 0.8077031 0.00019
As you can see, the infected runs all have very high kmer match proportions and correspondingly high bowtie mapping proportions (number of mapped reads/total number of reads) confirming that those runs contained HIV. The control runs however also had fairly high kmer match proportions (though distinct from the infected runs) but had significantly lower bowtie mapping proportions, indicating that those runs did not contain HIV. So if there was no HIV in those runs, how do they have 80-86% of HIV's kmers?