Question: Fast Implementation Of Table(Table()) For Huge Amount Of Elements In R?
gravatar for Ge
7.9 years ago by
Ge80 wrote:

I want to get the BAM multi-matching information. For instance, with Rsamtools package,

param = ScanBamParam(what="qname")
bamFlag(param) = scanBamFlag(isUnmappedQuery=FALSE)
bamReads = scanBam(bamFile, param=param)[[1]]$qname

Now I get the bamReads (the IDs of transcripts) in this BAM file.

Then I want to get the information of multi-matching by doing table(table(bamReads)). However, this step is really slow as bamReads can contain millions of IDs from BAM file.

Do you have any fast implement of doing this table(table()) step? Any C interface or other tools?


R • 1.7k views
ADD COMMENTlink modified 7.9 years ago by Jeremy Leipzig19k • written 7.9 years ago by Ge80

Why not just:

samtools view yours.bam | cut -f1 | sort | uniq -c > output.txt

R is great, but it is not the right one for this task.

ADD REPLYlink written 7.9 years ago by lh332k

Thanks! It works and much much faster!

ADD REPLYlink written 7.9 years ago by Ge80

Doing it by hand with apply() and as.matrix would be my first hint. And maybe you can speed it up using the aaply() function (parallel apply) from the "plyr" package.

ADD REPLYlink written 7.9 years ago by Mdeng510
gravatar for Steve Lianoglou
7.9 years ago by
Steve Lianoglou5.0k
Steve Lianoglou5.0k wrote:

If you've got the RAM to load your entire BAM file into memory, perhaps you might try to see if data.table is up to the task.

Here's an example that you might use to group reads by qname using the data that comes with the Rsamtools package:

library(data.table) ## make sure you have 1.8.0
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
res0 <- scanBam(fl)[[1]] # always list-of-lists
dt <- data.table(qname=res0$qname, rname=res0$rname, strand=res0$strand, 
                 start=res0$pos, end=res0$pos + res0$qwidth - 1L,

Note the value for key. If you are hurting for memory, tailor your call to scanBam to only return the columns from the BAM file that you want.

Now, to see how many places each read maps to, you can create a summary table like so:

mmap <- dt[, list(N=.N), by="qname"]
                     qname N
 [1,]  B7_589:1:101:825:28 2
 [2,] B7_589:1:110:543:934 2
 [3,] B7_589:1:122:337:968 2
 [4,]  B7_589:1:122:77:789 2
 [5,]  B7_589:1:168:69:249 2
 [6,]  B7_589:1:29:529:379 2
 [7,]  B7_589:2:30:644:942 2
 [8,]  B7_589:2:73:730:487 2
 [9,]    B7_589:2:9:49:661 2
[10,]  B7_589:3:71:478:175 2
First 10 rows of 1699 printed.

.N is a variable that tells you how many rows are in the group you are currently working in. The N column tells you how many times the qname appears in the BAM file you just loaded.

There is a small learning curve to learn how to use data.table, so make sure to read the FAQ and other docs. You would basically use it in similar situations you would use plyr::ddply or base::tapply (among others), and in situations like you have here (lots of rows in a data.frame), you'll see that it will be super performant (in time and memory usage) as compared to equivalent approaches.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Steve Lianoglou5.0k
gravatar for Leonor Palmeira
7.9 years ago by
Leonor Palmeira3.7k
Liège, Belgium
Leonor Palmeira3.7k wrote:

Personally, I wouldn't do this in R but using Python. R is great for many things, but basically not for computation involving huge loops.

Probably not optimal, but try this (on a file containing all your bamReads:

for i in f.readlines():
    if d.has_key(tmp):

for i in d.keys():
    if n.has_key(d[i]):

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Leonor Palmeira3.7k

It's true that loops will, for the most part, really kill you in R, but there are still ways to do this without using loops as long as you have enough RAM to load your data in to memory.

ADD REPLYlink written 7.9 years ago by Steve Lianoglou5.0k

can you provide some python code to count his qName instances?

ADD REPLYlink written 7.9 years ago by Jeremy Leipzig19k

I've edited my asnwer to provide some Python code for this.

ADD REPLYlink written 7.9 years ago by Leonor Palmeira3.7k
gravatar for Jeremy Leipzig
7.9 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

This comes down to the problem of tabulating a character vector (as opposed to a factor). R is going to make this a factor for itself even if you don't.

> system.time(bamReadTabulation<-table(bamReads$qname))
   user  system elapsed 
 93.101   0.022  93.123

is equivalent to:

> system.time(bamReadFactor<-factor(bamReads$qname))
   user  system elapsed 
 91.605   0.027  91.628 
> system.time(bamReadTabulation<-table(bamReadFactor))
   user  system elapsed 
  3.820   0.009   3.827

I don't know of anyway of asking scanBam to treat qname as a factor or Rle. Even readBamGappedReads with use.names=TRUE returns a character vector.

I guess it's not a big priority for most people since multiple hits are the kind of thing that gets swept under the rug.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Jeremy Leipzig19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1887 users visited in the last hour