How do I get info about which base matches to the reference genome from a bam file? Specifically, is it possible to do this in Rsamtools? Using the code below (where 'bam' and 'bai' have already been defined as the bam and bai file paths) I can use pileup get a data frame that has the counts for each base at each position in the specified range:
chr <- "5"
start <- 112164646
end <- 112164666
p_param <- PileupParam(distinguish_nucleotides=TRUE,
distinguish_strands=FALSE,
min_base_quality=10)
sbp <- ScanBamParam(which=GRanges(as.character(chr), IRanges(IRanges(start, end))))
res <- pileup(bam, index = bai, scanBamParam=sbp, pileupParam=p_param)
And what I get looks like this:
seqnames pos nucleotide count which_label
5 112164646 G 244 5:112164646-112164666
5 112164646 T 54 5:112164646-112164666
5 112164647 A 298 5:112164646-112164666
5 112164648 A 297 5:112164646-112164666
Is there any way to get it to tell me whether a base matches the reference base at each locus? The Rdocumentation link (https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/pileup) seems to suggest the possibility ("The ‘=’ nucleotide code in the SEQ field (to mean ‘identical to reference genome’) is supported by pileup, such that a match with the reference genome is counted separately in the results if distinguish_nucleotides is TRUE."). Thanks for any assistance.