Question: Extract data from a large bed file
gravatar for el24
4 months ago by
el2410 wrote:

Hi all!

I have a big .bed file, which is about 100GB of data, and I want to extract specific columns from it in a timely manner. I was wondering if there is a useful tool that works best for huge bed files. (or could be bed.gz) Could you please tell me if you know any excellent tool (or Python package) that can help me generate my desire output?

An example of head of the data is:

chr1    10006   10018   M6176_1.02-NR2F1    0.00117 +   sequence=taaccctaaccc
chr1    10006   10020   M6432_1.02-PPARD    0.00034 +   sequence=taaccctaacccta
chr1    10008   10030   M6456_1.02-RREB1    0.00014 -   sequence=GGGTTAGGGTTAGGGTTAGGGT

And imagine I build my output using columns exact value of 1, 2, 3, 5, and 6th columns, and also TF of 4th column. Therefore, for the first line of my data, I like my output to be as follows:

chr1 10006 10018 NR2F1 0.00117 +

I know I can extract it using awk, by this command, but I hope there are faster ways to do it:

cat file.bed | awk '($5<0.01)gsub(/.*-/,"",$4) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'

My second question is that if you can give me some recommendations to have criteria for limiting the values on one column, for example, keeping a row when the 5th column being less than 0.00001 values.

Thank you very much!

alignment bed genome • 205 views
ADD COMMENTlink modified 4 months ago by wm470 • written 4 months ago by el2410

Since you are interested in low-level text processing, I'd say that awk or basic python would be the way to go. Higher-level parsers provide useful shortcuts, but usually at the cost of speed.

ADD REPLYlink written 4 months ago by liorglic320
gravatar for wm
4 months ago by
wm470 wrote:

Have a simple test. It turns out, python is a bit faster than awk for your question. (before I have the test, I think AWK is faster).

I generate two files (3-million lines, 200MB; 30-million lines, 2GB) for the test based on your example. and make a few changes: (5-column < 0.01)

The test files:

$ wc -l test_3m_rows.bed test_30m_rows.bed
   3000000 test_3m_rows.bed
  30000000 test_30m_rows.bed

Awk version:

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_3m_rows.bed > out1.bed

real    0m5.135s
user    0m5.012s
sys     0m0.120s

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_30m_rows.bed > out2.bed

real    0m53.181s
user    0m49.875s
sys     0m1.244s

python version:

$ time python test_3m_rows.bed > out3.bed

real    0m4.416s
user    0m4.195s
sys     0m0.220s

$ time python test_30m_rows.bed > out4.bed

real    0m46.769s
user    0m43.935s
sys     0m2.396s

The output of each command:

$ wc -l out*.bed
   3000000 out1.bed
  30000000 out2.bed
   3000000 out3.bed
  30000000 out4.bed

Here is the python script:

$ cat
#!/usr/env/bin python3

import sys

with open(sys.argv[1]) as r:
    for line in r:
        fields = line.split('\t')
        if float(fields[4]) >= 0.01:
        fields[3] = fields[3].split('-')[1]
ADD COMMENTlink modified 4 months ago • written 4 months ago by wm470

Thank you very much, this is very helpful!

ADD REPLYlink written 4 months ago by el2410
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour