How can I allow for ambiguous matching to a reference sequence? I have a 4 base barcode proceeding my sequences which I need to preserve.
This is my pipeline:
bwa index amp.fa
samtools faidx amp.fa
bwa mem amp.fa file_R1.fastq file_R2.fastq > file.sam
samtools view -bS file.sam > file.bam
samtools sort file.bam > file.sorted.bam
samtools index file.sorted.bam
I then read the sorted BAM file using R with scanBam from Rsamtools and work with it there. Mostly just because I'm a lot more comfortable working with R.
The "amp.fa" file looks like this:
I'd hoped that the Ns would mean any reads aligning to "ATGCATGCATGCATGCATGCATGCATGC" would have the 4 proceeding bases align to "NNNN", so I'd be able to see what they are.
Can anyone suggest an alternative way to do this? Or a tweak to allow the capture of any sequence proceeding position 1 of the know sequence?
Many thanks in advance