Question: Loading Large Bed Files Into Bioconductor
3
gravatar for danielcook2013
6.0 years ago by
danielcook201370 wrote:

Hello -

I am trying to work with some very large BED Files obtained from GEO of ChIP-Seq results (Specifically - this data here). There are a few BED Files that I am hoping to visualize in addition to the MAC Peaks that were called. I wanted to try to use BioConductor to do this, but the BED Files (which contain upwards of 24 million aligned reads) cause R to choke. I am just starting with BioConductor.

What's the best way to visualize the reads with the peaks (called by MACS)? Is this something that can be done with BioConductor?

Thank you!

bioconductor chip-seq • 11k views
ADD COMMENTlink modified 3.3 years ago by ATpoint22k • written 6.0 years ago by danielcook201370
5
gravatar for seidel
6.0 years ago by
seidel6.8k
United States
seidel6.8k wrote:

When you say "use BioConductor" what specifically have you tried? The GenomicRanges and rtracklayer packages make this straightforward. Examining peaks could be as simple as:

library(GenomicRanges)
library(rtracklayer)

reads <- import.bed(con="yourfile.bed", asRangedData=F)

cov <- coverage(reads)

# for a peak on chr1 from 2000:3000
plot(cov[["chr1"]][1000:4000])

and it should work very efficiently with 24 million reads on most any computer.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by seidel6.8k

How long should it take to read in the bed file? This works no problem with smaller files but once they get up to 500 Mb or a few Gb it chokes. I am on a brand new macbook pro with 8 Gb of memory.

ADD REPLYlink written 6.0 years ago by danielcook201370

It shouldn't take more than a few minutes in my experience. However, you can check where the choke point is. Do you use "top" on your mac? Open a terminal (Terminal.app), type 'top' at the command line, then start your import process. By looking at top, you will be able to see how much memory is being consumed, and if you are running out. Type 'q' to quit top.

ADD REPLYlink written 5.9 years ago by seidel6.8k

Still no luck in getting this to work. All my memory gets eaten up. Granted - some are large files (2-3gb). Am I missing something major here? I see how bioconductor could be useful ... but don't quite understand how this is the case if it can't work with large files. None of the tutorials cover this import process.... Although I would hope you don't have to load entire files in to work with them?

I am trying to zip and index the files with tabix right now...we'll see if that helps. Thanks for your suggestions.

update: zipping and indexing did not work.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by danielcook201370

You mention "All my memory gets eaten up." Have you examined the relationship between file size and memory? In other words, if you sample 1 million reads into a bed file (head -1000000 your.bed > new.bed) and try that, how much memory is used? Then try 2 million, how much is used? This will help you extraoplate the nature of your problem. I have noticed with some genomes that rtracklayer can consume resources an be inefficient (i.e. non-canonical genomes with thousands of contigs). If you're simply reading in a bed file, what happens to the tests above is you just read it in as a dataframe? You can always convert it to one of the efficient bionconductor objects after you have it in the R environment. Bioconductor can work with large files, but different packages and functions are not all guaranteed to work in the same hardware footprint.

ADD REPLYlink written 5.9 years ago by seidel6.8k
2
gravatar for ATpoint
3.3 years ago by
ATpoint22k
Germany
ATpoint22k wrote:

The question is ~3 years old now, but as sequencing technology constantly improves and creates more and more output per run:

You may use the data.table package, providing the fread() function. It loads a 50mio rows, 6 columns bed file in as little as 10 sec, without using more than 2GB of RAM on standard laptop. It clearly outcompetes the read.table() function of R base, which in my experience may take an eternity (if it finishes at all for very large files).

fread("/path/to/file.bed", data.table=FALSE, header=FALSE)

If you want to select certain columns, then use

fread("/path/to/file.bed", data.table=FALSE, header=FALSE, select=c(1,2,3,6)

This would give you only chr, start, end and strand. You can then proceed by transforming the df into a GRanges object, which allows to use coverage() function from the GenomicRanges package, as described in on of the above posts.

ADD COMMENTlink modified 2.7 years ago • written 3.3 years ago by ATpoint22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour