Command to count reads in fastq file with last bases
2
1
Entering edit mode
5.7 years ago
MAPK ★ 2.1k

Can someone please help me count the number of reads in fastq file with 3' terminal bases? For example, I need to get the account of reads ending with A, T, G and C in any input.fastq

fastq readcount • 2.0k views
ADD COMMENT
3
Entering edit mode
5.7 years ago
ATpoint 81k

Heng Li's seqtk collaborating with awk (or here mawk which is faster):

seqtk seq -A in.fastq | mawk '$1 ~ ">" {print ">foo";next} {n=split($1,A,""); print A[n]}' | seqtk comp | mawk 'OFS="\t" {sumA+=$3; sumC+=$4; sumG+=$5; sumT+=$6} END {print "A:"sumA,"C:"sumC,"G:"sumG,"T:"sumT}'

It basically creates a new fasta file, all with the dummy name >foo (just that seqtk does not complain about missing headers), and the last base of the sequence, which is then sent to seqtk comp, summarizing the number of occurrences of ATCG in that file. Last part is simply adding each ATCG and printing it.

Still, I encourage you to try and find these kinds of solutions yourself. In my experiences, putting together these little scripts really help to make yourself better with the command line. The solutions are out there in the communities, you only have to pipe the individual pieces together to get your specific output.

ADD COMMENT
3
Entering edit mode
5.7 years ago
Eric Lim ★ 2.1k

I like the seqtk version. You can also use a combination of native commands like sed, rev, and cut to accomplish the same thing and it's easier on the eyes.

sed -n '2~4p' input.fq | rev | cut -c1 | sort | uniq -c

sort | uniq -c is decent for low millions of lines. If your file is large, awk or mawk is generally preferred. You can replace it with awk '{ cts[$0] += 1 } END { for (v in cts) print cts[v], v }' and it'll save you quite a bit of time. If your file is really large, the seqtk version should be the fastest.

To use awk or mawk for sort unique, you can save it as an alias.

alias sortuniq="awk '{ cts[\$0] += 1 } END { for (v in cts) print cts[v], v }'"
ADD COMMENT
1
Entering edit mode

rev and cut -c1 can be cut, further shortening the code:

 sed -n '2~4p' input.fastq | grep -o '.$' | sort |uniq -c
ADD REPLY
0
Entering edit mode

Nice, very simplistic. I like it.

ADD REPLY

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6