Question

Calculatio median in a Fastq file using bioawk/awk

1

Entering edit mode

4.0 years ago

zo • 0

I need to calculate the median in the fastq file, how can I do this with awk or bioawk?

awk median bioawk fastq • 2.2k views

ADD COMMENT • link updated 4.0 years ago by cpad0112 21k • written 4.0 years ago by zo • 0

0

Entering edit mode

median of?

ADD REPLY • link 4.0 years ago by cpad0112 21k

0

Entering edit mode

median length of reads in the fastq file.

ADD REPLY • link 4.0 years ago by zo • 0

0

Entering edit mode

You can use fastqc for that!

ADD REPLY • link 4.0 years ago by brunobsouzaa ▴ 840

score 4 · Accepted Answer · 2021-11-05

Straight up awk soliution would be:

cat *.fq  | awk 'NR % 4 == 0 { print length($1) } ' | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'

I would do it with bioawk and datamash

cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | datamash median 1

you can also implement a better median in awk as shown here:

https://stackoverflow.com/questions/6166375/median-of-column-with-awk

that works like this:

 cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | bioawk ' { a[i++]=$1; } \
                       END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'

score 3 · Accepted Answer · 2021-11-05

with seqkit

$ seqkit stats -a test.fq

In output, Q2 is median value. If you want to extract fastq name, and median, try

$ seqkit stats test.fq -aT | cut -f1,10

with awk and st (available in brew, conda):

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --summary

 min    q1  median  q3  max
    151 151 151 151 151

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --median 
151

Ministat is available in system repos for many GNU-Linux and BSD distros and you can try:

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq | ministat -A |tr -s " "| cut -f5 -d " "

Median
151