Can someone please help me count the number of reads in fastq file with 3' terminal bases? For example, I need to get the account of reads ending with A, T, G and C in any input.fastq
Can someone please help me count the number of reads in fastq file with 3' terminal bases? For example, I need to get the account of reads ending with A, T, G and C in any input.fastq
Heng Li's seqtk
collaborating with awk (or here mawk which is faster):
seqtk seq -A in.fastq | mawk '$1 ~ ">" {print ">foo";next} {n=split($1,A,""); print A[n]}' | seqtk comp | mawk 'OFS="\t" {sumA+=$3; sumC+=$4; sumG+=$5; sumT+=$6} END {print "A:"sumA,"C:"sumC,"G:"sumG,"T:"sumT}'
It basically creates a new fasta file, all with the dummy name >foo
(just that seqtk
does not complain about missing headers), and the last base of the sequence, which is then sent to seqtk comp
, summarizing the number of occurrences of ATCG in that file. Last part is simply adding each ATCG and printing it.
Still, I encourage you to try and find these kinds of solutions yourself. In my experiences, putting together these little scripts really help to make yourself better with the command line. The solutions are out there in the communities, you only have to pipe the individual pieces together to get your specific output.
I like the seqtk
version. You can also use a combination of native commands like sed
, rev
, and cut
to accomplish the same thing and it's easier on the eyes.
sed -n '2~4p' input.fq | rev | cut -c1 | sort | uniq -c
sort | uniq -c
is decent for low millions of lines. If your file is large, awk
or mawk
is generally preferred. You can replace it with awk '{ cts[$0] += 1 } END { for (v in cts) print cts[v], v }'
and it'll save you quite a bit of time. If your file is really large, the seqtk
version should be the fastest.
To use awk
or mawk
for sort unique, you can save it as an alias.
alias sortuniq="awk '{ cts[\$0] += 1 } END { for (v in cts) print cts[v], v }'"
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
rev and cut -c1 can be cut, further shortening the code:
Nice, very simplistic. I like it.