How to count fastq reads
9
13
Entering edit mode
9.5 years ago
Chenglin ▴ 260

I have NGS data in .fastq format. Does anyone know how to count reads? Thank you!

sequence next-gen fastq reads • 128k views
ADD COMMENT
40
Entering edit mode
6.7 years ago
sacha ★ 2.4k

'wc' is faster than awk

#yourfile.fastq  
echo $(cat yourfile.fastq|wc -l)/4|bc

#yourfile.fastq.gz
 echo $(zcat yourfile.fastq.gz|wc -l)/4|bc
ADD COMMENT
0
Entering edit mode

Can you please explain what is the purpose of | bc???

ADD REPLY
2
Entering edit mode

BC is a calculator. Here I divide the number on ligne by 4.

Try "echo 4+2|bc" to understand

ADD REPLY
8
Entering edit mode
9.5 years ago

for fasta files: grep -c "^>" file.fasta

for fastq files: grep -c "^@" file.fastq

for fastq files: awk '{s++}END{print s/4}' file.fastq

ADD COMMENT
0
Entering edit mode

Hmm, that will work for fasta files, not for fastq file!

For fastq files, you can simply count the number of lines and then divide it by 4.

ADD REPLY
0
Entering edit mode

Thank you! Can you explain a little bit, please? why grep -c ">" file.fastq is not working? And why I need to count the number of lines and then divide it by 4?

ADD REPLY
0
Entering edit mode

fasta sequences start with ">", and fastq identifiers start with "@". counting "@" starts is much faster.

ADD REPLY
2
Entering edit mode

Again, there can be a quality score @ that can be starting from the first line, this will throw off your counts if you use grep. Better use the line counts and divide it by 4 (even if it takes some time)

@Chenglin: each fastq read comprises of 4 lines, first line is identifier, second line is the sequence, third line is a blank line (starts with +, may sometime have same description as first line) and the last line is quality for the each base in the second line. So if you count the total number of lines, you get number of reads times 4, so you divide it by 4 and you have the actual number of reads.

ADD REPLY
0
Entering edit mode

that's right. indeed quality scores can contain "@" characters, which may appear at the beginning of the quality line: http://en.wikipedia.org/wiki/FASTQ_format#Format. I've edited my answer to include the awk solution that counts all lines and divides them by 4 at the end.

ADD REPLY
0
Entering edit mode

Hi Jorge,

I do see "@" in the quality line. Which command line I can use, please? I am a beginner and I am a little confused.....Sorry.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY
0
Entering edit mode

"@" characters may appear in the quality line, so counting "@" starts doesn't work. instead, as previously stated, you may want to count all lines and divide them by 4. this can be accomplished in several ways, and the awk solution on my answer is just one of them.

ADD REPLY
0
Entering edit mode

Thank you very much! I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY
5
Entering edit mode
9.5 years ago
arnstrm ★ 1.9k

Here is the fancy script in bash:

#!/bin/bash
if [ $# -lt 1 ] ; then
    echo ""
    echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq"
    echo "counts the number of reads in a fastq file"
    echo ""
    exit 0
fi
 
filear=${@};
for i in ${filear[@]}
do
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  :L
  s=([0-9]+)([0-9]{3})=\1,\2=
  t L'
done

Save this as a file (like count_fastq.sh ), then

chmod +x count_fastq.sh

After this you can run it on your file as:

./count_fastq.sh your_file.fastq
ADD COMMENT
0
Entering edit mode

Thank you so much, arnstrm. Thank you for sending me the cool scirpt! I am just a beginner. I will try your script.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY
0
Entering edit mode

Wow, I tried this and it works!

Can I have your interpretation on the script you showed above? If you are too occupied to share, can you kindly give some rough guidance on how to interpret this script?

Thanks in advance!

ADD REPLY
0
Entering edit mode

This script didn't work with me Even after several trials May someone advice how to run it to get the count as I am beginner

ADD REPLY
3
Entering edit mode
6.2 years ago
for i in `ls *.fastq.gz`; do echo $(zcat ${i} | wc -l)/4|bc; done
  • of note: ` are around ls .fastq.gz

Explanation:

For all gzip compressed fastq files, display the number of reads since 4 lines = 1 reads

*Just a good one-liner to see how many reads obtained from something like demultiplexing went

ADD COMMENT
2
Entering edit mode
4.5 years ago
sbwiecko ▴ 20

couldn't we just count the number of separations between the sequence and quality score lines?

grep -c "^+$" ./ref.fastq
ADD COMMENT
1
Entering edit mode

the only very slight problem I see is that very rare 1 base sequences (trimmed perhaps) with "+" quality would be counted twice, but I love the simplicity of this answer.

ADD REPLY
0
Entering edit mode

In some older fastq files, the separator repeats the header line (without the leading @).

ADD REPLY
1
Entering edit mode
9.5 years ago
HG ★ 1.2k
$readnumber= {(wc-l <name of fastq file>) /4 } *2
ADD COMMENT
0
Entering edit mode

Thank you so much, I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY
1
Entering edit mode
7.0 years ago
Fatima ▴ 1000
LINE_COUNT=$(wc -l filename.fastq)             #Counts the number of lines in fastq file
PART=(${LINE_COUNT//\t/ })                          # Removes the file name from the output and only returns number of lines
READ_COUNT=$((PART[0]/4))                       # Divides the number of lines by 4 which equals the number of reads
ADD COMMENT
0
Entering edit mode
9.5 years ago
CraigM ▴ 90

A non command line approach for fastq files is to use FastQC, a quality control checking program which is part of many workflows to get a general idea of sequence quality.

Number of reads is listed in the summary statistics at the beginning of the report.

ADD COMMENT
0
Entering edit mode

Thank you.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY
0
Entering edit mode
9.5 years ago
Daniel ★ 4.0k

If you're looking at fastq files for the first time, you will probably want to do more than just counting the number of reads, and for that I would recommend downloading FASTQC.

Really good tool for assessing the state of your sequencing data.

ADD COMMENT

Login before adding your answer.

Traffic: 1053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6