Bioconductor For Large Bam File Analysis
2
6
Entering edit mode
12.8 years ago
Sakti ▴ 510

Hello!

I have encountered in several sites that people have developed many Bioc packages for the analysis of NGS data using R language. However, one of the disadvantages is that almost all of these packages require a loading into memory, which makes the whole analysis thing quite inefficient.

Suppose all I have is a big 65 Gig BAM file from the re-sequencing of a human. If I only want to work with chromosome 1, is there any way that I can input only this information into the Bioc packages (i.e. genomic ranges, genomegraph, etc) without having to load the whole 65 Gig file???

I'd love to know how to do it!!

Thanks!

next-gen sequencing r bioconductor bam • 11k views
ADD COMMENT
8
Entering edit mode
12.8 years ago

You definitely do not have to load the entire BAM file into R if you just want the reads from one chromosome.

The Rsamtools package lets you do this by properly configuring the which parameter in a call to ScanBamParam, with a subsequent call to scanBam. See the top of page 2 for its intro vignette. Note that you will need to know how long your chromosome is (so you can put appropriate stop/end coords).

I've been building a package (SeqTools)myself that is the result of refactoring some code out of different analyses I've been doing that makes this easier ... it even reads the length of chromosomes from the header of the bam file itself.

Essentially, after you install it, it will work as below -- bam.file is the path to the bam file you want to read from:

R> library(SeqTools)
R> reads.1 <- getReadsFromSequence(bam.file, 'chr1')

reads.1 will be a GRanges object with all your reads (and some meta information about them) from chromosome 1.

If you want to install it, you'll have to d/l or checkout that project and install the R/pkg folder into R. You can do that from the command line:

$ cd to/project/base/R
$ R CMD INSTALL pkg

That should go smoothly as long as you have (i) the required dependencies (see the DESCRIPTION file), and (ii) Make sure you have the required dependencies .. There's lots of stuff in there and little documentation, so use at your own risk :-)

ADD COMMENT
1
Entering edit mode

Steve, that's an awesome reply, I'll try both packages and let you know my results. Thanks for taking the time to share this information!

ADD REPLY
1
Entering edit mode
12.8 years ago
Abhi ★ 1.6k

I concur with you mostly that data loading time in R could be frustrating at times. Couple of things you could do.

  1. Split the bam file to have only data for the chromosome you are interested in. samtools view -u aln.bam <exact_chromosome_name> for which you want to pull out the data from bam file.

  2. Kick off R with more memory than it allocates by default. (link)

hth, -Abhi

ADD COMMENT
1
Entering edit mode

Hey Abhi thanks, it seems that although R is very powerful it has its downfalls... will try your suggestions!

ADD REPLY

Login before adding your answer.

Traffic: 2362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6