Question: <Data Munging> Convert Methylation Percentage to Number of Reads
gravatar for wanziyi89
5.3 years ago by
Singapore, Temasek Life Sciences Laboratory
wanziyi8960 wrote:

Dear all, I have a file as shown below:

chrBase        chr       base    strand    coverage    freqC    freqT
chrLG1.6955    chrLG1    6955       R       12         0.00     100.00



I would like to print this as a new file with the following format :


Which will give me

chrLG1 6955 12 12

Usually I will just do

awk'{print$1"/t"$2} input.txt > output.txt

but this time it involves some mathematical conversions. Can anyone here guide me on how to do this conversion in UNIX?

many thanks in advance and I wish you a good Monday!




conversion methylation • 1.1k views
ADD COMMENTlink modified 5.3 years ago by Devon Ryan98k • written 5.3 years ago by wanziyi8960
gravatar for Devon Ryan
5.3 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

awk doesn't have a round() function, at least not the last time I checked, so it'd be easier in python or perl. I hate perl (though it'd be shorter here!), so here's a couple lines of python:

#!/usr/bin/env python
import sys
import csv

for line in csv.reader(open(sys.argv[1], "r")) :
    frac = float(line[6])
    cov = int(line[4])
    nT = round(cov*frac/100)
    nC = cov-nT
    print("%s\t%s\t%i\t%i" % (line[1], line[2], cov, nT))

Note that the number of digits shown limits how precise this method can be, but there's no way around that.

Edit: I shouldn't code before my first cup of coffee. My original reply mixed 3 programming languages!

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Devon Ryan98k

Awk has rounding through using printf in place of print, but there's a gotcha for the unweary ... Awk uses C's sprintf function, which by default does 'even' rounding on tie breaks (rounding towards the nearest even number) ...

awk 'BEGIN {printf("%.0f\n", 0.5)}'
awk 'BEGIN {printf("%.0f\n", 1.5)}'

Depending on your C libraries, it's configurable (see here), but in my experience, few systems support this.

However, assuming the calculated proportion must basically fall back to an integer number of reads +- float arithmetic error, this shouldn't make any difference here. So...

tail -n+2 <yourfile> | awk '{printf("%s\t%s\t%s\t%.0f\n", $2, $3, $5, $5*$7/100)}
ADD REPLYlink modified 14 months ago by _r_am32k • written 5.3 years ago by george.ry1.1k

Good to know!

ADD REPLYlink modified 14 months ago by _r_am32k • written 5.3 years ago by Devon Ryan98k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1611 users visited in the last hour