Question

Run multiple fastq.gz files with FastQC and yield single report?

4

Entering edit mode

8.7 years ago

pbigbig ▴ 250

Hi,

I would like to run FastQC for multiple fastq.gz files (both R1 and R2 of multiple lanes), but this program run in queueing manner, so that it produces a summary for each file, separately.

I wonder that is there any way to merge DATA of all input fastq.gz files together and somehow make FastQC to see them as a single fastq file, so it will produce ONLY 1 summary report for all my files, is that possible? What I need to do to obtain that?

Sorry I am new in this area, thank you in advance for your valuable help!

Regards,
Phuong

fastQC sequencing quality-control • 25k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.7 years ago by pbigbig ▴ 250

Ram · Answer 1 · 2015-07-26

EDIT: I highly recommend people use MultiQC now for summarising FastQC reports. It's a fantastic tool.

I recently generated FastQC reports for ~100 FASTQ files and did not want to inspect them all by hand. Instead I wrote a small script to parse the module results in the zip file produced by FastQC. It simply takes the result of each module and converts them to integer scores (1=pass, 0=warning, -1=fail). I write these results out as a CSV file and then go on to plot the results as a heat map using R or Python. Something like this might help you if you want to quickly inspect all FastQC results and see which FASTQ files need manual inspection.

#!/usr/bin/env python3

# Import necessary libraries:
import csv
import os
import subprocess
import zipfile

# List modules used by FastQC:
modules = ['Basic_Statistics',
           'Per_base_sequence_quality',
           'Per_tile_sequence_quality',
           'Per_sequence_quality_scores',
           'Per_base_sequence_content',
           'Per_sequence_GC_content',
           'Per_base_N_content',
           'Sequence_Length_Distribution',
           'Sequence_Duplication_Levels',
           'Overrepresented_sequences',
           'Adapter_Content',
           'Kmer_Content']

# Set dict to convert module results to integer scores:
scores = {'pass': 1,
          'warn': 0,
          'fail': -1}

# Get current working directory:
cwd = os.getcwd()

# Get list of '_fastqc.zip' files generated by FastQC:
files = [file for file in os.listdir(cwd) if file.endswith('_fastqc.zip')]

# List to collect module scores for each '_fastqc.zip' file:
all_mod_scores = []

# Read fastqc_data.txt file in each archive:
for file in files:
    archive = zipfile.ZipFile(file, 'r') # open '_fastqc.zip' file
    members = archive.namelist() # return list of archive members
    fname = [member for member in members if 'fastqc_data.txt' in member][0] # find 'fastqc_data.txt' in members
    data = archive.open(fname) # open 'fastqc_data.txt'

    # Get module scores for this file:
    mod_scores = [file]
    for line in data:
        text = line.decode('utf-8') 
        if '>>' in text and '>>END' not in text:
            text = text.lstrip('>>').split()
            module = '_'.join(text[:-1])
            result = text[-1]
            mod_scores.append(scores[result])

    # Append to all module scores list:
    all_mod_scores.append(mod_scores)

    # close all opened files:
    data.close()
    archive.close()

# Write scores out to a CSV file:
with open('all_mod_scores.csv', 'w') as f:
    writer = csv.writer(f)
    for mod_scores in all_mod_scores:
        writer.writerow(mod_scores)
    f.close()

Ram · Answer 2 · 2015-07-26

4

Entering edit mode

8.7 years ago

Ying W ★ 4.2k

This sounds like a really bad idea but if you really want to, I believe you can specify multiple fastq when using fastQC on command line

Reasons why this is a bad idea

different lanes will have different error profiles, how will you know which one failed if you merge them all together?
fastqc only takes a subset of reads, not sure what happens when you specify multiple but you could be sampling fewer reads from one file
you can run fastqc in parallel for every lane (assuming u have the I/O bandwidth) so running multiple lanes together should not take too long

ADD COMMENT • link 8.7 years ago by Ying W ★ 4.2k

0

Entering edit mode

Thank you Ying, I will consider your suggestions.

Given that all my run files are quite similar in error profile (I have check them individually by fastQC, although R2 normally have lower quality than R1), I intended to execute several different scenarios on my whole read set: original (unprocessed), trimmed (by Trimmomatic), merged (by FLASh), and then fastQC them to have an overview comparison of data quality with different processing pipeline (so I would prefer 3 report summaries of 3 scenarios instead of dozen for each). Is my idea feasible?

I will consider your hint regarding the subset sampling of fastQC.

Thanks a lot!

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.7 years ago by pbigbig ▴ 250

Ram · Answer 3 · 2015-07-26

2

Entering edit mode

8.7 years ago

arnstrm ★ 1.8k

I think you have to concatenate all gzipped files and then run fastqc on it. For gzipped files you can just:

cat *fastq.gz >> combined.fq.gz

and then:

fastqc combined.fq.gz

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by arnstrm ★ 1.8k

0

Entering edit mode

Thank you for your help!

ADD REPLY • link 8.7 years ago by pbigbig ▴ 250

Ram · Answer 4 · 2015-07-26

1

Entering edit mode

8.7 years ago

Antonio R. Franco ★ 5.1k

cat file1.fq file2.fq file3.fq file4.fq file5.fq > merged.fq
cat file*.fq > merged.fq

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Thanks a lot !

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by pbigbig ▴ 250