Question

Command to count reads in fastq file with last bases

1

Entering edit mode

6.9 years ago

MAPK ★ 2.1k

Can someone please help me count the number of reads in fastq file with 3' terminal bases? For example, I need to get the account of reads ending with A, T, G and C in any input.fastq

fastq readcount • 2.5k views

ADD COMMENT • link updated 6.9 years ago by Eric Lim ★ 2.2k • written 6.9 years ago by MAPK ★ 2.1k

score 3 · Accepted Answer · 2018-07-29

Heng Li's seqtk collaborating with awk (or here mawk which is faster):

seqtk seq -A in.fastq | mawk '$1 ~ ">" {print ">foo";next} {n=split($1,A,""); print A[n]}' | seqtk comp | mawk 'OFS="\t" {sumA+=$3; sumC+=$4; sumG+=$5; sumT+=$6} END {print "A:"sumA,"C:"sumC,"G:"sumG,"T:"sumT}'

It basically creates a new fasta file, all with the dummy name >foo (just that seqtk does not complain about missing headers), and the last base of the sequence, which is then sent to seqtk comp, summarizing the number of occurrences of ATCG in that file. Last part is simply adding each ATCG and printing it.

Still, I encourage you to try and find these kinds of solutions yourself. In my experiences, putting together these little scripts really help to make yourself better with the command line. The solutions are out there in the communities, you only have to pipe the individual pieces together to get your specific output.

score 3 · Accepted Answer · 2018-07-29

3

Entering edit mode

6.9 years ago

Eric Lim ★ 2.2k

I like the seqtk version. You can also use a combination of native commands like sed, rev, and cut to accomplish the same thing and it's easier on the eyes.

sed -n '2~4p' input.fq | rev | cut -c1 | sort | uniq -c

sort | uniq -c is decent for low millions of lines. If your file is large, awk or mawk is generally preferred. You can replace it with awk '{ cts[$0] += 1 } END { for (v in cts) print cts[v], v }' and it'll save you quite a bit of time. If your file is really large, the seqtk version should be the fastest.

To use awk or mawk for sort unique, you can save it as an alias.

alias sortuniq="awk '{ cts[\$0] += 1 } END { for (v in cts) print cts[v], v }'"

ADD COMMENT • link 6.9 years ago by Eric Lim ★ 2.2k

1

Entering edit mode

rev and cut -c1 can be cut, further shortening the code:

 sed -n '2~4p' input.fastq | grep -o '.$' | sort |uniq -c