Tool:Bioawk - Fasta, Fastq, Sam, Bed, Gff Aware Awk Programming
0
32
Entering edit mode
8.9 years ago

Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.

Code

The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.

Documentation

There is a a short manual page in the main distribution and a longer HTML formatted help page

Examples

Extract unmapped reads without header:

awk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:

awk -c sam -H '!and($flag,4)'

Reverse complement FASTA:

awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz

Create FASTA from SAM (uses revcomp if FLAG & 16)::

samtools view aln.bam | \
    awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'

Get the %GC from FASTA:

awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz

Get the mean Phred quality score from FASTQ:

awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz

Take column name from the first line (where "age" appears in the first line of input.txt):

awk -c header '{ print $age }' input.txt
awk Tool • 13k views
ADD COMMENT
2
Entering edit mode

It should be noted that gc($seq) doesn't exclude Ns from the calculation, so ACGTNNNNACGT results in 0.333333.

ADD REPLY
0
Entering edit mode

I wonder what the acceptable answer is for this case. One could ignore the Ns or count them as 1/4, or as bioawk does it here count them all.  

ADD REPLY
0
Entering edit mode

I was just going to post this. :)

ADD REPLY
0
Entering edit mode

indeed, the previous discussion made us all realize what a good fit it is for this section

ADD REPLY
0
Entering edit mode

i used bioawk to calculate mean quality score of fastq and it gave me one mean per each read. now, how can i calculate overall quality mean using output of bioawk?

ADD REPLY
0
Entering edit mode

Questions need to be asked separately as a new entry and not as a comment or answer to a post.

ADD REPLY

Login before adding your answer.

Traffic: 2313 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6