Question: <Data Munging> Convert Methylation Percentage to Number of Reads
0
5.3 years ago by
wanziyi8960
Singapore, Temasek Life Sciences Laboratory
wanziyi8960 wrote:

Dear all, I have a file as shown below:

```chrBase        chr       base    strand    coverage    freqC    freqT
chrLG1.6955    chrLG1    6955       R       12         0.00     100.00```

I would like to print this as a new file with the following format :

`Chr"/t"Position"/t"Coverage"/t"Numbers_of_reads_in_T`

Which will give me

`chrLG1 6955 12 12`

Usually I will just do

`awk'{print\$1"/t"\$2} input.txt > output.txt`

but this time it involves some mathematical conversions. Can anyone here guide me on how to do this conversion in UNIX?

many thanks in advance and I wish you a good Monday!

conversion methylation • 1.1k views
modified 5.3 years ago by Devon Ryan98k • written 5.3 years ago by wanziyi8960
3
5.3 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

awk doesn't have a round() function, at least not the last time I checked, so it'd be easier in python or perl. I hate perl (though it'd be shorter here!), so here's a couple lines of python:

```#!/usr/bin/env python
import sys
import csv

for line in csv.reader(open(sys.argv[1], "r")) :
frac = float(line[6])
cov = int(line[4])
nT = round(cov*frac/100)
nC = cov-nT
print("%s\t%s\t%i\t%i" % (line[1], line[2], cov, nT))
```

Note that the number of digits shown limits how precise this method can be, but there's no way around that.

Edit: I shouldn't code before my first cup of coffee. My original reply mixed 3 programming languages!

2

Awk has rounding through using printf in place of print, but there's a gotcha for the unweary ... Awk uses C's sprintf function, which by default does 'even' rounding on tie breaks (rounding towards the nearest even number) ...

``````awk 'BEGIN {printf("%.0f\n", 0.5)}'
0
awk 'BEGIN {printf("%.0f\n", 1.5)}'
2
``````

Depending on your C libraries, it's configurable (see here), but in my experience, few systems support this.

However, assuming the calculated proportion must basically fall back to an integer number of reads +- float arithmetic error, this shouldn't make any difference here. So...

``````tail -n+2 <yourfile> | awk '{printf("%s\t%s\t%s\t%.0f\n", \$2, \$3, \$5, \$5*\$7/100)}
``````

Good to know!