If you've got the RAM to load your entire BAM file into memory, perhaps you might try to see if data.table is up to the task.
Here's an example that you might use to group reads by
qname using the data that comes with the Rsamtools package:
library(data.table) ## make sure you have 1.8.0
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
res0 <- scanBam(fl)[] # always list-of-lists
dt <- data.table(qname=res0$qname, rname=res0$rname, strand=res0$strand,
start=res0$pos, end=res0$pos + res0$qwidth - 1L,
Note the value for
key. If you are hurting for memory, tailor your call to
scanBam to only return the columns from the BAM file that you want.
Now, to see how many places each read maps to, you can create a summary table like so:
mmap <- dt[, list(N=.N), by="qname"]
[1,] B7_589:1:101:825:28 2
[2,] B7_589:1:110:543:934 2
[3,] B7_589:1:122:337:968 2
[4,] B7_589:1:122:77:789 2
[5,] B7_589:1:168:69:249 2
[6,] B7_589:1:29:529:379 2
[7,] B7_589:2:30:644:942 2
[8,] B7_589:2:73:730:487 2
[9,] B7_589:2:9:49:661 2
[10,] B7_589:3:71:478:175 2
First 10 rows of 1699 printed.
.N is a variable that tells you how many rows are in the group you are currently working in. The
N column tells you how many times the
qname appears in the BAM file you just loaded.
There is a small learning curve to learn how to use
data.table, so make sure to read the FAQ and other docs. You would basically use it in similar situations you would use
base::tapply (among others), and in situations like you have here (lots of rows in a data.frame), you'll see that it will be super performant (in time and memory usage) as compared to equivalent approaches.