Question: Merge series and recognize end of it - AWK
0
2.3 years ago by
Korsocius150
Korsocius150 wrote:

Dear all,

could you help me please merge rows by coordinates in column \$2. There are series of coordinates growing by one. I wan to output f.e. : 1st row merge to 4th row 9079811-9079814 and after that there is no series so merge it to another row etc.. for 3rd column in input I would like to count average.

I wrote some script, but this script merge all rows from first coordinate to last coordinate. no condition to series.

`````` awk -F'\t' -v OFS="\t" '{print \$2,\$4,\$3,\$1}' input | awk '!x[\$2]{x[\$2]=\$1}y[\$2]<\$1{y[\$2]=\$1;}x[\$2]>\$1{x[\$2]=\$1} {sum+=\$3} END{for(i in y)print \$1,x[i],y[i],sum/NR,i}' | sort -V -k1,1 > output
``````

INPUT:

``````chr12   9079811 29  A2M
chr12   9079812 29  A2M
chr12   9079813 29  A2M
chr12   9079814 28  A2M
chr12   9091202 5   A2M
chr12   9091203 5   A2M
chr12   9091204 5   A2M
chr12   9091390 15  A2M
chr12   9091391 15  A2M
chr12   9091392 13  A2M
``````

OUTPUT:

``````chr12  9079811  9079814 28.75 A2M
chr12  9091202  9091204 5     A2M
chr12  9091390  9091392 14.3  A2M
``````
awk merge • 741 views
modified 2.3 years ago by Alex Reynolds30k • written 2.3 years ago by Korsocius150

Why do you _need_ to use `awk`? Python is better suited, as your programming will be more readable.

I think it won't be a problem to do it in python, but..I am beginner in this language :-(

All the more reason to use it - how else would you learn?

1
2.3 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You may want to use standard set and map operation toolkits, like BEDOPS.

BEDOPS will do a lot of the heavy lifting with set and map operations and reduce the likelihood of errors. It will almost definitely be faster than other tools or scripting languages.

First, convert the input to a sorted BED file, using `awk` and BEDOPS `sort-bed`:

``````\$ awk -vOFS="\t" '{ print \$1, \$2, (\$2+1), \$4, \$3; }' input.txt | sort-bed - > input.bed
``````

Note that we reverse the order of score and ID fields. This is so that the resulting BED file follows UCSC convention, where the ID is in the fourth column, and the score data are in the fifth column.

Merge the intervals with BEDOPS `bedops --merge`:

``````\$ bedops --merge input.bed > merge.bed
``````

Map the input BED intervals to the merged intervals, using the `--mean` and `--echo-map-id-uniq` operators with BEDOPS `bedmap` to get back your answer:

``````\$ bedmap --echo --mean --echo-map-id-uniq --delim '\t' merge.bed input.bed > answer.bed
``````

Each line of `answer.bed` contains the merged interval, the mean of signal from `input.bed` over the merged interval, and the unique ID or IDs from `input.bed` over the merged interval.

The `bedmap` tool has lots of other score operators. So while you may want to use `--mean` here, you could also use `--median` or `--min`, `--max`, `--sum`, `--stdev` etc. depending on what you might need to do with signal down the road.

0
2.3 years ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

Hi Korsocius, please try this solution:

``````awk 'BEGIN {previousPos=9079810; nextPost=0; sum=0; linecount=0; startPos=0} {nextPos=\$2; if (previousPos+1==nextPos) {previousPos=nextPos; sum+=\$3; linecount+=1} else {print \$1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"\$4; previousPos=nextPos; sum=0; sum+=\$3; linecount=1}; if (linecount==1) startPos=\$2} END {print \$1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"\$4} ' input

chr12   9079811 9079814 28.75   A2M
chr12   9091202 9091204 5   A2M
chr12   9091390 9091392 14.3333 A2M
``````