I have been given a pipeline to work on and I'm trying to use the position of reads from a BAM file to label the reads later. The pipeline uses R to extract information from the BAM file such as QNAME, RNAME, POS, QWIDTH etc. It uses the scanBam() package to extract the information, which it adds to a data.frame for use in GenomicRanges to find overlaps between the BAM file and a GTF file. For some reason when I extract the position in this way (my code is the bit where it creates "endPos", "positions" and "nodeName") the position I get in the output is about 100,000 bp off. The aim is to be able to use the nodeName to link straight to Ensembl. I'm not overly familiar with scanBam and R, is there anyway it could be extracting the position wrong? Or is it more likely to be an issue with the actual BAM file? I feel some sort of transformation is happening because the POS column in the BAM file has values such as 1,429, while the actual gene position is 63,358,333 (I assume one is the position on the contig and the other is on the chromosome?). I ran bamtobed from bedtools to get the positions and got different positions again, I'm not quite sure what I'm doing wrong.
Any help would be greatly appreciated.
The is the R code used to create the Genomic Ranges GRanges matrix:
what <- c("qname","rname","strand","pos","qwidth","seq")
flag <- scanBamFlag(isUnmappedQuery=FALSE)
param <- ScanBamParam(what=what, flag=flag)
bamHITS <- scanBam(bamFile,param=param)[[1]]
cat("Collating data...\n")
sequences <- as.data.frame(bamHITS$seq)
endPos <- (bamHITS$pos + bamHITS$qwidth)
positions <- paste(bamHITS$pos,endPos, sep="-")
nodeName <- paste(bamHITS$rname, positions, bamHITS$qname, sep=":")
input <- data.frame(nodeName,bamHITS$rname,bamHITS$strand,bamHITS$pos,bamHITS$qwidth,stringsAsFactors=FALSE)
input$Sequence <- as.character(sequences[,1])
colnames(input) <- c("Read","Seq","Strand","Start","Width","Sequence")
input <- input[input$Seq %in% names(chrLens),]
cat("Running GRanges...\n")
GRbam <- GRanges(
seqnames = Rle(factor(input$Seq,levels=names(chrLens))),
ranges = IRanges(start=input$Start, width=input$Width),
strand = Rle(factor(input$Strand)),
seqlengths = chrLens,
name = input$Read,
sequence = input$Sequence
)