Extract basic info from fastq files in an efficient way?
3
0
Entering edit mode
4.8 years ago
rioualen ▴ 610

Hi,

I'm looking for an easy and efficient way to get basic information from a given fastq file programmatically, like number of sequences, average sequence length, average GC%... I adapted this code but it's actually very slow for big fastq files. I was considering parsing FastQC's results but there must be an easier way? I've googled it for hours but I can't find anything.

Thanks a lot!

EDIT

I forgot to mention I'm trying to put that into a python function, so here's what I've got right now:

def fastq_stats(fastq_file):
    os.system("reformat.sh in=" + fastq_file + " gchist=temp")
    with open('temp', 'r') as f:
        first_line = f.readline()
        gc_rate = float(first_line.split()[1])
    os.system("rm temp")
    read_count = int(sum(1 for line in open(fastq_file))/4)

    return (read_count, gc_rate)

It's a bit ugly imo, but it will do for now. I'm still interested by better options! I'm coding in python and R mostly, and this is part of a pipeline that generates FastQC reports automatically, hence my quest for a ready-to-use FastQC parser.

fastq fastqc • 8.8k views
ADD COMMENT
2
Entering edit mode

To know number of seqs quickly, you can use on Linux machine wc -l fastq_file | awk '{$1/4; print}'

ADD REPLY
4
Entering edit mode
4.8 years ago
GenoMax 117k

reformat.sh from BBMap suite can produce the following summaries. reformat.sh in1=R1.fq.gz in2=R2.fq.gz bhist=filename

bhist=<file>            Base composition histogram by position.
qhist=<file>            Quality histogram by position.
qchist=<file>           Count of bases with each quality value.
aqhist=<file>           Histogram of average read quality.
bqhist=<file>           Quality histogram designed for box plots.
lhist=<file>            Read length histogram.
gchist=<file>           Read GC content histogram.
gcbins=100              Number gchist bins.  Set to 'auto' to use read length.
gcplot=f                Add a graphical representation to the gchist.

It will also write a summary to stderr that you can capture to a file.

Input:                          17037614 reads                  851456901 bases
Output:                         17037614 reads (100.00%)        851456901 bases (100.00%)

Time:                           14.574 seconds.
Reads Processed:      17037k    1169.08k reads/sec
Bases Processed:        851m    58.42m bases/sec

It will be fast :-)

ADD COMMENT
4
Entering edit mode
4.8 years ago

fastx_quality_stats from fastxtoolkit:

$ gzip -dc hcc1395_normal_rep1_r1.fastq.gz | fastx_quality_stats | head
column  count   min max sum mean    Q1  med Q3  IQR lW  rW  A_Count C_Count G_Count T_Count N_Count Max_count
1   331958  2   37  10278936    30.96   32  32  32  0   32  32  28945   104041  174760  24163   49  331958

seqkit stats:

$ time(seqkit stats hcc1395_normal_rep1_r1.fastq.gz)
file                             format  type  num_seqs     sum_len  min_len  avg_len  max_len
hcc1395_normal_rep1_r1.fastq.gz  FASTQ   DNA    331,958  50,125,658      151      151      151

real    0m0.500s
user    0m0.280s
sys 0m0.052s

fastqutils from NGSutils:

    $ fastqutils stats hcc1395_normal_rep1_r1.fastq.gz > ~/Desktop/test.out

Output:

Space:  basespace
Pairing:    Fragmented
Quality scale:  Illumina
Number of reads:    331958
Length distribution
Mean:   151.0
StdDev: 0.0
Min:    151
25 percentile:  151
Median: 151
75 percentile:  151
Max:    151
Total:  331958

Quality distribution
pos mean    stdev   min 25pct   50pct   75pct   max count
1   30.9645678068   3.68287083487   2   32  32  32  37  331958
2   31.4920592364   2.91843166829   2   32  32  32  37  331958
.
.
.
150 35.7938052404   8.68052946923   7   32  37  42  42  331958
151 35.3446369721   8.77621250147   7   32  37  42  42  331958

Average quality string
?@DEEIIJJJJJJJJJJJJJJJJJJIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIIHIHHHHHHHHHHHHHHGGGGGGGGGGGGGGFFFFFFFFEEEEEEEEDD

size of the file:

$ du -ah hcc1395_normal_rep1_r1.fastq.gz 
26M hcc1395_normal_rep1_r1.fastq.gz
ADD COMMENT
0
Entering edit mode

Looks like what I'm looking for! Gonna dig into that :-)

ADD REPLY
1
Entering edit mode
4.8 years ago
h.mon 34k

MultiQC will parse FastQC files for you, and will generate some terribly good-looking graphics as a bonus.

ADD COMMENT
0
Entering edit mode

I know MultiQC is great, but I'm interested in using these numbers as variables in my pipeline :-)

ADD REPLY
0
Entering edit mode

Under the multiqc_data/ folder (default name), there are tab-separated text files with all the metrics collected. These files are easily parsed with awk / python / perl.

ADD REPLY
0
Entering edit mode

Oh yeah that's right, though it implies to run MultiQC in the first place. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6