Question: Sum column until a certain value is reached then print that line
1
gravatar for josephmccarter1
3.3 years ago by
Italy
josephmccarter170 wrote:

Hi All,

I have a bedgraph file with 4 columns - Chr, Start, End, Value.

I want to go through the 4th column and sum each row in that column until I have reached a certain value then print that row.

so for instance - Sum column until you a reached a value of 100 then print that entire line

I have done a little scripting but I just can't think how to do this. If someone could help me with an AWK command that would do this that would be great. If you ha the time to break down & explain the command a little that would be a bonus!

Thank you

bedgraph awk bioinformatics • 2.2k views
ADD COMMENTlink modified 3.3 years ago by John12k • written 3.3 years ago by josephmccarter170
4
gravatar for Alex Reynolds
3.3 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You could do the following:

$ awk '\
    BEGIN { \
        s = 0; \
    } \
    { \
        s += $4; \
        if (s >= 100) { \
            print $0; \
            exit; \
        } \
    }' foo.bedgraph

Explanation: The BEGIN block initializes a summation variable called s, before any lines are processed. The main block processes each line, adding the value of the fourth column ($4) to the s variable. It then tests if that variable s is greater than or equal to 100. If not, the block processes the next line. If s is over the threshold, the current line is printed ($0) and awk exits early.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Alex Reynolds28k
1

I would strongly advise a new programmer against using numerical variables that will have their values changed during runtime, without explicit initialization.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Alex Reynolds28k

I never understood the point of the BEGIN block. The above works just as well without it too..

ADD REPLYlink written 3.3 years ago by 5heikki8.4k

If you're writing stuff to be reused, or at least self-documenting, initializing variables seems like a decent practice.

ADD REPLYlink written 3.3 years ago by Alex Reynolds28k

Inside some script/program.. sure. But as one-liner..

ADD REPLYlink written 3.3 years ago by 5heikki8.4k

Perhaps your variable is already set where you need it. The "s" here may start at zero without initializing it. I often have to use BEGIN to set the field separators before reading anything.

ADD REPLYlink written 3.3 years ago by karl.stamm3.5k
2
gravatar for venu
3.3 years ago by
venu6.2k
Germany
venu6.2k wrote:

I think it would be easy if you create a new column with sum result of each row and then take out the required row

awk '{s+=$4}{print $1"\t"$2"\t"$3"\t"$4"\t"s}' input.txt | awk '$5==100 {print $1"\t"$2"\t"$3"\t"$4}'
ADD COMMENTlink written 3.3 years ago by venu6.2k
0
gravatar for John
3.3 years ago by
John12k
Germany
John12k wrote:

In python, because I think you can understand how it works immediately :)

with open('/path/to/some.file') as whatever:
    counter = 0
    for line in whatever:
        Chr, Start, End, Value = line.split('\t') # \t is for tab.

        counter += int(Value) # int() to turn the text of value into an actual number.
        if counter > 100:
            print line
            counter = 0
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by John12k
1

I have very little python experience but I decided to give this approach a go.

I am getting numerous errors "ValueError: too many values to unpack" "ValueError: invalid literal for int() with base 10: 'Value\t\n'"

When i look at "whatever" , the "Chr, Start, End, Value" seems to be in as "nan"

Does the bedgraph file need to be in a certain format?

ADD REPLYlink written 3.2 years ago by josephmccarter170

Hahaha, awesome :D This is such a nice example of why little scripts like the ones in this thread are best served with Python rather than awk. You said your data had 4 columns - actually your data has 5 columns, because your row ends with 'Value\t\n' - and none of the awk script would have noticed because they all say "take column X and..." rather than the python's much safer "take the 4 columns and...".

I'm sure one could write safe awk - its just that I very rarely see it here on Biostars because people often try to write the shortest, least-safe-as-possible awk scripts, to get the most upvotes, hehe.

So anyway, you got the first error of Value\t\n because there were 5 columns to unpack into 4 variable names. I guess you then hacked the code a bit to get it to work, and in the process almost did, except now the last column includes the \tab and \newline characters. The error about invalid literal for int() with base 10 is because we were expecting something in the 4th column that looks like a number, but instead got "Value\t\n". Why is Value\t\n there in the first place? Probably because you added a header manually, which also explains the trailing tab? So the solution now is to delete the tab at the end of your header line (or if its on all the other lines, tell python to expect 5 columns) and tell python that the first line has to be skipped, because it is a header. You'd do that by on the second line putting next(whatever), then the for line in whatever will start on line 2.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by John12k
1

Yes - I did try hack it and I also changed the format of the file.bedgraph a little. Above is the code I used and the output from that code. and first 11 lines file.bedgraph is also on display

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by josephmccarter170

Ok, so that number in the fourth column - and this is really my fault because I should have known when you said bedgraph - is a float (decimal number), not an int(), so where it says "int(" in the code put "float(" so the math is right, otherwise you'll be rounding down to the nearest whole number.

Also, it looks like your delimiting on commas now, not tabs, so change the split('\t') to split(',') So the code should now look like:

with open('./file.bedgraph','r') as whatever:
    counter = 0
    next(whatever)
    for line in whatever:
        Chr, Start, End, Value = line.split(",")
        counter += float(Value)
        if counter > 127770:
            print line,
            counter = 0
        else: print counter

I added an else: print counter to the bottom which will print the current counter, just for this demo to show its working as you'd expect. It prints for your data:

-4.98
-5.11
0.46
10.88
16.45
27.73
44.71
59.27
63.27
72.12
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by John12k

Thank you - The script is working nicely!

One thing however, is that once it reaches the 127770 value it continues counting again and finds the next set of intervals that are separated by a value of 127770. e.g

eb2.3chr8,41827920,41827930,319.36 <- my line of interest
eb2.3chr8,41833860,41833870,376.01
eb2.3chr8,41838450,41838460,443.08
eb2.3chr8,41842060,41842070,465.21
eb2.3chr8,41845830,41845840,437.67
eb2.3chr8,41849320,41849330,481.47
eb2.3chr8,41852540,41852550,622.60
eb2.3chr8,41855230,41855240,374.73
eb2.3chr8,41858230,41858240,513.43
eb2.3chr8,41861700,41861710,707.50
eb2.3chr8,41864610,41864620,277.27
eb2.3chr8,41868180,41868190,350.90
eb2.3chr8,41871830,41871840,350.05
eb2.3chr8,41876280,41876290,289.40
eb2.3chr8,41881100,41881110,273.99
eb2.3chr8,41885450,41885460,253.01
eb2.3chr8,41890010,41890020,404.27
eb2.3chr8,41896450,41896460,172.96
eb2.3chr8,41901160,41901170,228.03
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by josephmccarter170

Oh, I thought that's what you wanted? :) You need to delete a line from the script for it to keep the counter going and not reset. I think if you can understand what each of those 9 lines of code are doing Joseph, you can figure out the answer to your last question on your own :) Chances are you already have after posting this since you almost got the data in the right format before ... :P

ADD REPLYlink written 3.2 years ago by John12k
1

Thank you John - yeah what I done was replaced counter = 0 at the bottom of the script with break and that worked.

ADD REPLYlink written 3.2 years ago by josephmccarter170
1

Safer Python would probably send the output of line.split() to an array or list to dereference by index, a problem that awk gets around by explicitly dereferencing by index. In addition, it is often a good idea to wrap Python calls that throw common errors/exceptions (such as the int() cast) in try..except blocks.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Alex Reynolds28k

Agree, there's always safer code one should be using :) But errors aren't such a bad thing in Python, since the stack trace is often quite useful in figuring out the bug. Compared to a JS stack trace anyway, hahah. The really bad thing here was that int() cast would have gone totally undetected if he hadn't pasted his data. That's a bit naughty... for these sorts of posts I should throw in a check for that sort of thing.

ADD REPLYlink written 3.2 years ago by John12k
with open('/file.bedgraph') as whatever:
counter = 0
next(whatever)
for line in whatever:
    Chr, Start, End, Value = line.split("\t") # \t is for tab.

    counter += int(Value) # int() to turn the text of value into an actual number.
    if counter > 127770:
        print line
        counter = 0

---OUTPUT---
'Chr,Start,End,Value\n'


Traceback (most recent call last):
  File "<pyshell#259>", line 5, in <module>
    Chr, Start, End, Value = line.split("\t") # \t is for tab.
ValueError: need more than 1 value to unpack
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by josephmccarter170
1
file.bedgraph 

Chr,Start,End,Value
eb2.3chr8,41806600,41806660,-4.98
eb2.3chr8,41806660,41806670,-0.13
eb2.3chr8,41806670,41806700,5.57
eb2.3chr8,41806700,41806720,10.42
eb2.3chr8,41806720,41806730,5.57
eb2.3chr8,41806730,41806750,11.28
eb2.3chr8,41806750,41806760,16.98
eb2.3chr8,41806760,41806770,14.56
eb2.3chr8,41806770,41806790,4.00
eb2.3chr8,41806790,41806830,8.85
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by josephmccarter170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour