I have set of subject sequences (25 to 40 bps long). My query sequences are 200-300 bp long. I am trying to find the location of short subject sequences in long query sequences. This is the exact inverse of the alignment problem, where we have long reference and we have short reads that we align to that long reference sequence. I have tried creating a blast db using the short subject sequences (25 to 40 bp) and tried to query that blast db using the long (200-300 bp) sequences, I am getting good results, but the sensitivity is not that great. I have tried bowtie2 and bwa-mem, but the results are even worse. Does anyone know how to solve this problem other than doing global-local alignment against every subject sequence for every query sequence. Any help is appreciated.
I am getting good results, but the sensitivity is not that great.
How do you know this? How dissimilar do your sequences need to be before you consider a hit to be a false positive?
I have set of subject sequences (25 to 40 bps long). My query sequences are 200-300 bp long. I have tried bowtie2 and bwa-mem, but the results are even worse.
You can swap your query and subjects and align 25-40bp sequences against a 'reference' of 200-300bp sequences. Short read aligners have generally have better performance if you align the shorter sequence to the longer sequences as they penalise partial matches. You'll need to explicitly enable multi-mapping alignment as you definitely want to report all alignment positions of your 25-40bp sequences.