Question: Effectively search for positions in large bedgraph files using python (or other language if needed)
gravatar for DVA
3.6 years ago by
United States
DVA550 wrote:

I guess it is more of an algorithm question. I have a group of large bedgraph files (shown below, contains every single position of the genome), each is about 20-30Gb in size. I need to look into these files and locate the lines having locations match another file (small, < 100kb) (e.g. file test.txt, see below). How should I do it efficiently please?

I read posts such as: but it is more for a general situation.

I think for my case I should take advantage of the fact that my files are "sorted", and that each time I use a line in "text.txt", to locate a line (line X for example) in bedgraph, anything above lineX is no longer relevant. I have some draft ideas about how to write it out in python, but also would like to hear suggestions from you.

I don't mind calling other languages in my python code. If there are existing software doing similar things, please let me know also. Thank you very much.

#largeFile.bedgraph (0 system)
chrM    0       1       5183
chrM    1       2       5299
chrM    2       3       5344
chrM    3       4       5439
chrM    4       5       5525
chrM    5       6       5579
chr19    3       4       6
chr20    4       5       35

#test.txt (1 system)
chrM    3
chrM    4
chr20   5

I would like to output:

chrM    3       4       5439
chrM    4       5       5525
chr20    4       5       35

because they are the record match test.txt.

python bedgraph • 1.4k views
ADD COMMENTlink modified 3.6 years ago by Alex Reynolds31k • written 3.6 years ago by DVA550
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Sort your first file with sort-bed:

$ sort-bed --max-mem 2G --tmpdir $PWD largeFile.bedgraph > largeFile.bed

You only need to sort (resort?) once. Also, specifying --tmpdir ensures that you don't fill up /tmp with intermediate data, given the size of your input file.

Convert your second file to a sorted BED file:

$ awk '{ print $0"\t"($1+1); }' test.txt | sort-bed - > test.bed

Do a set operation on the two sorted BED files with bedops:

$ bedops --element-of 1 largeFile.bed test.bed > answer.bed

You could probably do this much faster with bedextract:

$ bedextract largeFile.bed test.bed > answer.bed

This is because largeFile.bed does not appear to contain nested elements, which allows use of bedextract to gain some significant optimizations to be done on a BED search.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds31k

Hi @AlexReynolds Thank you so much for the reply and introducing bedops! I will give a short:)

ADD REPLYlink written 3.6 years ago by DVA550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour