Question: Merge series and recognize end of it - AWK
0
gravatar for Korsocius
2.3 years ago by
Korsocius150
Korsocius150 wrote:

Dear all,

could you help me please merge rows by coordinates in column $2. There are series of coordinates growing by one. I wan to output f.e. : 1st row merge to 4th row 9079811-9079814 and after that there is no series so merge it to another row etc.. for 3rd column in input I would like to count average.

I wrote some script, but this script merge all rows from first coordinate to last coordinate. no condition to series.

 awk -F'\t' -v OFS="\t" '{print $2,$4,$3,$1}' input | awk '!x[$2]{x[$2]=$1}y[$2]<$1{y[$2]=$1;}x[$2]>$1{x[$2]=$1} {sum+=$3} END{for(i in y)print $1,x[i],y[i],sum/NR,i}' | sort -V -k1,1 > output

INPUT:

chr12   9079811 29  A2M
chr12   9079812 29  A2M
chr12   9079813 29  A2M
chr12   9079814 28  A2M
chr12   9091202 5   A2M
chr12   9091203 5   A2M
chr12   9091204 5   A2M
chr12   9091390 15  A2M
chr12   9091391 15  A2M
chr12   9091392 13  A2M

OUTPUT:

chr12  9079811  9079814 28.75 A2M
chr12  9091202  9091204 5     A2M
chr12  9091390  9091392 14.3  A2M
awk merge • 741 views
ADD COMMENTlink modified 2.3 years ago by Alex Reynolds30k • written 2.3 years ago by Korsocius150

Why do you _need_ to use awk? Python is better suited, as your programming will be more readable.

ADD REPLYlink written 2.3 years ago by RamRS27k

I think it won't be a problem to do it in python, but..I am beginner in this language :-(

ADD REPLYlink written 2.3 years ago by Korsocius150

All the more reason to use it - how else would you learn?

ADD REPLYlink written 2.3 years ago by RamRS27k
1
gravatar for Alex Reynolds
2.3 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You may want to use standard set and map operation toolkits, like BEDOPS.

BEDOPS will do a lot of the heavy lifting with set and map operations and reduce the likelihood of errors. It will almost definitely be faster than other tools or scripting languages.

First, convert the input to a sorted BED file, using awk and BEDOPS sort-bed:

$ awk -vOFS="\t" '{ print $1, $2, ($2+1), $4, $3; }' input.txt | sort-bed - > input.bed

Note that we reverse the order of score and ID fields. This is so that the resulting BED file follows UCSC convention, where the ID is in the fourth column, and the score data are in the fifth column.

Merge the intervals with BEDOPS bedops --merge:

$ bedops --merge input.bed > merge.bed

Map the input BED intervals to the merged intervals, using the --mean and --echo-map-id-uniq operators with BEDOPS bedmap to get back your answer:

$ bedmap --echo --mean --echo-map-id-uniq --delim '\t' merge.bed input.bed > answer.bed

Each line of answer.bed contains the merged interval, the mean of signal from input.bed over the merged interval, and the unique ID or IDs from input.bed over the merged interval.

The bedmap tool has lots of other score operators. So while you may want to use --mean here, you could also use --median or --min, --max, --sum, --stdev etc. depending on what you might need to do with signal down the road.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Alex Reynolds30k
0
gravatar for Kevin Blighe
2.3 years ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

Hi Korsocius, please try this solution:

awk 'BEGIN {previousPos=9079810; nextPost=0; sum=0; linecount=0; startPos=0} {nextPos=$2; if (previousPos+1==nextPos) {previousPos=nextPos; sum+=$3; linecount+=1} else {print $1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"$4; previousPos=nextPos; sum=0; sum+=$3; linecount=1}; if (linecount==1) startPos=$2} END {print $1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"$4} ' input

chr12   9079811 9079814 28.75   A2M
chr12   9091202 9091204 5   A2M
chr12   9091390 9091392 14.3333 A2M
ADD COMMENTlink written 2.3 years ago by Kevin Blighe59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1265 users visited in the last hour