How do I replace a value used in script with a range of values in a file?
1
0
Entering edit mode
8.0 years ago

I recently enquired about an AWK script to keep summing a column until it reaches a certain value then print that line.

The Awk script I got from Alex Reynolds was very helpful and is show below

$ awk '\
BEGIN { \
    s = 0; \
} \
{ \
    s += $4; \
    if (s >= 100) { \
        print $0; \
        exit; \
    } \
}' chr1.bedgraph

I would use this on chr1.bedgraph

chr1.bedgraph
chr1 1000  2000  25
chr1 2000  3000  50
chr1 3000  4000  25
chr1 4000  5000  30

And the awk script would print the line were the sum of $4 is reaches 100

chr1 3000 4000 25

I now want to replace "100" in the line "if(s >=100)" with every value in values.txt (apart from $1)

values.txt 
sample1_chr1 200  50  90
sample2_chr1 300  60  40
sample3_chr1 400  20  40

So the script would essentially use the numbers values.txt from line1 $2, $3, $4 then move on to line2 and line3 and so on.

So that the output would print the line from chr1.bedgraph that when it reaches 200 then below that, 50, then 90. then it would print the lines when it reaches 300, 60 and 40.

Any thoughts? I'm not a very experienced programmer and I have been trying to do this for a while now.

Many thanks

bedgraph AWK • 2.1k views
ADD COMMENT
0
Entering edit mode

If it gets slightly more complicated (like your question now) I guess it becomes time to move away from awk to e.g. python. It's not completely clear what you try to accomplish (and with which purpose).

ADD REPLY
0
Entering edit mode

You are right - I feel I am pushing the limits of awk and its probably time to move over to python. My apologies for not being more clear. The example I had shown wasn't very well put together. I am basically obtaining "median" values over large peak domains in ChIP-seq with the intention to show movement these genetic loci between individuals(sample1, sample2, sample3). I obtain a total read count across the domain and extract the point where the median read count value lies. Think of it as a centre of gravity value(point at which 50% of the reads lies). I then want to find the 5% and 95% values. My values.txt file contains 4 columns - sample_chr, 5%, 50%, 95%. If i had a script that would use my 5%, 50% & 95% read count values to scan my pre defined domains (like the awk script was doing) it would make the processing a lot faster...Is this a little more clear?

Note: my values.txt file does not represent the real values so that would also make things more confusing.

ADD REPLY
3
Entering edit mode
8.0 years ago
tomc ▴ 90

You can pass your script an arbitrary variable say 'LIMIT' with

script.awk -v "LIMIT=73" chr1.bedgraph

then inside the script use

if(s >=LIMIT)

To get the sequence of limits from your second file (values.txt),
filter the first column out (assuming tab separated otherwise specify your -d delimiter)

cut -f 2,3,4 values.txt

As @decosterwouter says it is not clear what you are hoping to achieve or how the values would be applied as limits nor what range of your first file, the limit would apply.

to apply each of the values to all of the first bedgraph file

for v in $(cut -f 2,3,4 values.txt) ; do script.awk -v "LIMIT=4{V}" chr1.bedgraph; done

Hope that get you closer, but if you are doing something more complicated please explain it better and be ready to consider python or other language of your choice

ADD COMMENT

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6