Question: Bioconductor For Large Bam File Analysis
gravatar for Sakti
7.8 years ago by
United States
Sakti360 wrote:


I have encountered in several sites that people have developed many Bioc packages for the analysis of NGS data using R language. However, one of the disadvantages is that almost all of these packages require a loading into memory, which makes the whole analysis thing quite inefficient.

Suppose all I have is a big 65 Gig BAM file from the re-sequencing of a human. If I only want to work with chromosome 1, is there any way that I can input only this information into the Bioc packages (i.e. genomic ranges, genomegraph, etc) without having to load the whole 65 Gig file???

I'd love to know how to do it!!


ADD COMMENTlink written 7.8 years ago by Sakti360
gravatar for Steve Lianoglou
7.8 years ago by
Steve Lianoglou5.0k
Steve Lianoglou5.0k wrote:

You definitely do not have to load the entire BAM file into R if you just want the reads from one chromosome.

The Rsamtools package lets you do this by properly configuring the which parameter in a call to ScanBamParam, with a subsequent call to scanBam. See the top of page 2 for its intro vignette. Note that you will need to know how long your chromosome is (so you can put appropriate stop/end coords).

I've been building a package (SeqTools)myself that is the result of refactoring some code out of different analyses I've been doing that makes this easier ... it even reads the length of chromosomes from the header of the bam file itself.

Essentially, after you install it, it will work as below -- bam.file is the path to the bam file you want to read from:

R> library(SeqTools)
R> reads.1 <- getReadsFromSequence(bam.file, 'chr1')

reads.1 will be a GRanges object with all your reads (and some meta information about them) from chromosome 1.

If you want to install it, you'll have to d/l or checkout that project and install the R/pkg folder into R. You can do that from the command line:

$ cd to/project/base/R

That should go smoothly as long as you have (i) the required dependencies (see the DESCRIPTION file), and (ii) Make sure you have the required dependencies .. There's lots of stuff in there and little documentation, so use at your own risk :-)

ADD COMMENTlink written 7.8 years ago by Steve Lianoglou5.0k

Steve, that's an awesome reply, I'll try both packages and let you know my results. Thanks for taking the time to share this information!

ADD REPLYlink written 7.8 years ago by Sakti360
gravatar for Abhi
7.8 years ago by
United States
Abhi1.5k wrote:

I concur with you mostly that data loading time in R could be frustrating at times. Couple of things you could do.

  1. Split the bam file to have only data for the chromosome you are interested in. samtools view -u aln.bam <exact_chromosome_name> for which you want to pull out the data from bam file.

  2. Kick off R with more memory than it allocates by default. (link)

hth, -Abhi

ADD COMMENTlink written 7.8 years ago by Abhi1.5k

Hey Abhi thanks, it seems that although R is very powerful it has its downfalls... will try your suggestions!

ADD REPLYlink written 7.8 years ago by Sakti360
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1683 users visited in the last hour