Hello,
I am looking for advice on aligning error prone NGS reads to a limited number of templates.
I have sequencing data from a 2 x 250bp NovaSeq run (350M reads in each direction) and am interested in a 26bp region near the end of R2. This 26bp region corresponds to my CRISPRi oligos which I need to identify.
I have managed to trim the R2 sequence to the 26bp region I am interested in using cutadapt. The mean Phred score for this region is around 25 according to FastQC. See example:
I have 280 possible sequences that this region could correspond to (my CRISPR oligo library).
I have attempted to use bowtie 1 for alignment, using the oligo library as my “index” but only ~30% of reads mapped to oligos using this method. (I chose bowtie 1 as bowtie 2 is optimised for reads of >50bp and my reads are 26bp.)
I want to leverage the small size of my index (only 280 possibilities) to improve the % of mapped reads.
I have tried to do this manually by calculating the Levenshtein distance between every read and all 280 guides, filtering for the top 2 matches and manually choosing thresholds for an acceptable best call. My final filters are a Levenshtein distance of no more than 6 for the best match and a difference in Levenshtein distances between the best and second best read of at least 4. These were determined empirically by examining a small subset of my data. This process is very slow and the thresholds unvalidated. However, the advantage is that it allows me to accept alignments with many errors, as long as it is sufficiently better than the second-best match.
I am looking for advice on whether any validated tools or pipelines are available to find the best possible match of a limited number of templates to an error prone region. Or for suggestions on how to improve the above process.
Thank you in advance for your help!
Kristy
Then cut these bases out before you align. You can do that with
bbduk.sh
like this:Adjust trimleft number to keep the 26 bases you want.
You can then easily extend this to do the alignments as well
Thank you for this suggestion GenoMax!
I introduced stagger to the beginning of my reads so fixed length trimming will not work in this case. But the sequence based trimming from cutadapt seems to be working very well.
I have not tried bbmap for alignment, that you for the suggestion. I will have a go and see how it performs.