Position Extraction from BAM file incorrect (using scanBam in R)
0
1
Entering edit mode
6.7 years ago

I have been given a pipeline to work on and I'm trying to use the position of reads from a BAM file to label the reads later. The pipeline uses R to extract information from the BAM file such as QNAME, RNAME, POS, QWIDTH etc. It uses the scanBam() package to extract the information, which it adds to a data.frame for use in GenomicRanges to find overlaps between the BAM file and a GTF file. For some reason when I extract the position in this way (my code is the bit where it creates "endPos", "positions" and "nodeName") the position I get in the output is about 100,000 bp off. The aim is to be able to use the nodeName to link straight to Ensembl. I'm not overly familiar with scanBam and R, is there anyway it could be extracting the position wrong? Or is it more likely to be an issue with the actual BAM file? I feel some sort of transformation is happening because the POS column in the BAM file has values such as 1,429, while the actual gene position is 63,358,333 (I assume one is the position on the contig and the other is on the chromosome?). I ran bamtobed from bedtools to get the positions and got different positions again, I'm not quite sure what I'm doing wrong.

Any help would be greatly appreciated.

The is the R code used to create the Genomic Ranges GRanges matrix:

what <- c("qname","rname","strand","pos","qwidth","seq")
flag <- scanBamFlag(isUnmappedQuery=FALSE)
param <- ScanBamParam(what=what, flag=flag)
bamHITS <- scanBam(bamFile,param=param)[[1]]

cat("Collating data...\n")
sequences <- as.data.frame(bamHITS$seq) 

endPos <- (bamHITS$pos + bamHITS$qwidth)
positions <- paste(bamHITS$pos,endPos, sep="-")
nodeName <- paste(bamHITS$rname, positions, bamHITS$qname, sep=":")

input <- data.frame(nodeName,bamHITS$rname,bamHITS$strand,bamHITS$pos,bamHITS$qwidth,stringsAsFactors=FALSE)
input$Sequence <- as.character(sequences[,1]) 
colnames(input) <- c("Read","Seq","Strand","Start","Width","Sequence")

input <- input[input$Seq %in% names(chrLens),]

cat("Running GRanges...\n")
GRbam  <-   GRanges(
         seqnames   = Rle(factor(input$Seq,levels=names(chrLens))),
         ranges     = IRanges(start=input$Start, width=input$Width),
         strand     = Rle(factor(input$Strand)),
         seqlengths = chrLens,
         name       = input$Read,
     sequence   = input$Sequence
      )
R scanBam GenomicRanges • 2.3k views
ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6