Entering edit mode
                    4.0 years ago
        zo
        
    
        •
    
    0
    I need to calculate the median in the fastq file, how can I do this with awk or bioawk?
I need to calculate the median in the fastq file, how can I do this with awk or bioawk?
Straight up awk soliution would be:
cat *.fq  | awk 'NR % 4 == 0 { print length($1) } ' | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
I would do it with bioawk and datamash
cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | datamash median 1
you can also implement a better median in awk as shown here:
that works like this:
 cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | bioawk ' { a[i++]=$1; } \
                       END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
with seqkit
$ seqkit stats -a test.fq
In output, Q2 is median value. If you want to extract fastq name, and median, try
$ seqkit stats test.fq -aT | cut -f1,10
with awk and st (available in brew, conda):
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --summary
 min    q1  median  q3  max
    151 151 151 151 151
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --median 
151
Ministat is available in system repos for many GNU-Linux and BSD distros and you can try:
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq | ministat -A |tr -s " "| cut -f5 -d " "
Median
151
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
median of?
median length of reads in the fastq file.
You can use
fastqcfor that!