Question: How to count fastq reads
1
gravatar for Chenglin
3.1 years ago by
Chenglin90
United States
Chenglin90 wrote:

I have NGS data in .fastq format. Does anyone know how to count reads? Thank you!

next-gen fastq sequence reads • 12k views
ADD COMMENTlink modified 3 months ago by sacha1.2k • written 3.1 years ago by Chenglin90
3
gravatar for Jorge Amigo
3.1 years ago by
Jorge Amigo10k
Santiago de Compostela, Spain
Jorge Amigo10k wrote:

for fasta files: grep -c "^>" file.fasta

for fastq files: grep -c "^@" file.fastq

for fastq files: awk '{s++}END{print s/4}' file.fastq

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Jorge Amigo10k

Hmm, that will work for fasta files, not for fastq file!

For fastq files, you can simply count the number of lines and then divide it by 4.

ADD REPLYlink written 3.1 years ago by arnstrm1.6k

Thank you! Can you explain a littble bit, please? why [grep -c ">" file.fastq] is not working? And why I need to count the number of lines and then divide it by 4?

ADD REPLYlink written 3.1 years ago by Chenglin90

fasta sequences start with ">", and fastq identifiers start with "@". counting "@" starts is much faster.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jorge Amigo10k
1

Again, there can be a quality score @ that can be starting from the first line, this will throw off your counts if you use grep. Better use the line counts and divide it by 4 (even if it takes some time)

@Chenglin: each fastq read comprises of 4 lines, first line is identifier, second line is the sequence, third line is a blank line (starts with +, may sometime have same description as first line) and the last line is quality for the each base in the second line. So if you count the total number of lines, you get number of reads  times 4, so you divide it by 4 and you have the actual number of reads.

 

 

 

ADD REPLYlink written 3.1 years ago by arnstrm1.6k

that's right. indeed quality scores can contain "@" characters, which may appear at the beginning of the quality line: http://en.wikipedia.org/wiki/FASTQ_format#Format. I've edited my answer to include the awk solution that counts all lines and divides them by 4 at the end.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jorge Amigo10k

Hi Jorge,

I do see "@" in the quality line. Which command line I can use, please? I am a beginner and I am a little confused.....Sorry.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLYlink written 3.1 years ago by Chenglin90

"@" characters may appear in the quality line, so counting "@" starts doesn't work. instead, as previously stated, you may want to count all lines and divide them by 4. this can be accomplished in several ways, and the awk solution on my answer is just one of them.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jorge Amigo10k

Thank you very much! I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLYlink written 3.1 years ago by Chenglin90
2
gravatar for arnstrm
3.1 years ago by
arnstrm1.6k
Ames, IA
arnstrm1.6k wrote:

Here is the fancy script in bash:

#!/bin/bash
if [ $# -lt 1 ] ; then
    echo ""
    echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq"
    echo "counts the number of reads in a fastq file"
    echo ""
    exit 0
fi
 
filear=${@};
for i in ${filear[@]}
do
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  :L
  s=([0-9]+)([0-9]{3})=\1,\2=
  t L'
done

Save this as a file (like count_fastq.sh ), then

chmod +x count_fastq.sh

After this you can run it on your file as:

./count_fastq.sh your_file.fastq
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by arnstrm1.6k

Thank you so much, arnstrm. Thank you for sending me the cool scirpt! I am just a beginner. I will try your script.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLYlink written 3.1 years ago by Chenglin90
2
gravatar for sacha
3 months ago by
sacha1.2k
France
sacha1.2k wrote:

'wc' is faster than awk

#yourfile.fastq  
echo $(cat yourfile.fastq|wc -l)/4|bc

#yourfile.fastq.gz
 echo $(zcat yourfile.fastq.gz|wc -l)/4|bc
ADD COMMENTlink written 3 months ago by sacha1.2k
1
gravatar for HG
3.1 years ago by
HG1.1k
Germany
HG1.1k wrote:
$readnumber= {(wc-l <name of fastq file>) /4 } *2
ADD COMMENTlink written 3.1 years ago by HG1.1k

Thank you so much, I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLYlink written 3.1 years ago by Chenglin90
0
gravatar for CraigM
3.1 years ago by
CraigM80
Ireland
CraigM80 wrote:

A non command line approach for fastq files is to use FastQC, a quality control checking program which is part of many workflows to get a general idea of sequence quality.

 

Number of reads is listed in the summary statistics at the beginning of the report.

ADD COMMENTlink written 3.1 years ago by CraigM80

Thank you.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLYlink written 3.1 years ago by Chenglin90
0
gravatar for Daniel
3.1 years ago by
Daniel3.6k
Cardiff University
Daniel3.6k wrote:

If you're looking at fastq files for the first time, you will probably want to do more than just counting the number of reads, and for that I would recommend downloading FASTQC.

Really good tool for assessing the state of your sequencing data.

ADD COMMENTlink written 3.1 years ago by Daniel3.6k
0
gravatar for Afagh
7 months ago by
Afagh70
United states
Afagh70 wrote:
LINE_COUNT=$(wc -l filename.fastq)             #Counts the number of lines in fastq file
PART=(${LINE_COUNT//\t/ })                          # Removes the file name from the output and only returns number of lines
READ_COUNT=$((PART[0]/4))                       # Divides the number of lines by 4 which equals the number of reads
ADD COMMENTlink modified 7 months ago • written 7 months ago by Afagh70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1150 users visited in the last hour