Question: Effectively search for positions in large bedgraph files using python (or other language if needed)
0
gravatar for DVA
24 months ago by
DVA530
United States
DVA530 wrote:

I guess it is more of an algorithm question. I have a group of large bedgraph files (shown below, contains every single position of the genome), each is about 20-30Gb in size. I need to look into these files and locate the lines having locations match another file (small, < 100kb) (e.g. file test.txt, see below). How should I do it efficiently please?

I read posts such as: https://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho but it is more for a general situation.

I think for my case I should take advantage of the fact that my files are "sorted", and that each time I use a line in "text.txt", to locate a line (line X for example) in bedgraph, anything above lineX is no longer relevant. I have some draft ideas about how to write it out in python, but also would like to hear suggestions from you.

I don't mind calling other languages in my python code. If there are existing software doing similar things, please let me know also. Thank you very much.

#largeFile.bedgraph (0 system)
chrM    0       1       5183
chrM    1       2       5299
chrM    2       3       5344
chrM    3       4       5439
chrM    4       5       5525
chrM    5       6       5579
...
chr19    3       4       6
chr20    4       5       35

#test.txt (1 system)
chrM    3
chrM    4
chr20   5

I would like to output:

chrM    3       4       5439
chrM    4       5       5525
chr20    4       5       35

because they are the record match test.txt.

python bedgraph • 855 views
ADD COMMENTlink modified 24 months ago by Alex Reynolds28k • written 24 months ago by DVA530
2
gravatar for Alex Reynolds
24 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Sort your first file with sort-bed:

$ sort-bed --max-mem 2G --tmpdir $PWD largeFile.bedgraph > largeFile.bed

You only need to sort (resort?) once. Also, specifying --tmpdir ensures that you don't fill up /tmp with intermediate data, given the size of your input file.

Convert your second file to a sorted BED file:

$ awk '{ print $0"\t"($1+1); }' test.txt | sort-bed - > test.bed

Do a set operation on the two sorted BED files with bedops:

$ bedops --element-of 1 largeFile.bed test.bed > answer.bed

You could probably do this much faster with bedextract:

$ bedextract largeFile.bed test.bed > answer.bed

This is because largeFile.bed does not appear to contain nested elements, which allows use of bedextract to gain some significant optimizations to be done on a BED search.

ADD COMMENTlink modified 24 months ago • written 24 months ago by Alex Reynolds28k

Hi @AlexReynolds Thank you so much for the reply and introducing bedops! I will give a short:)

ADD REPLYlink written 24 months ago by DVA530
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1567 users visited in the last hour