Run multiple fastq.gz files with FastQC and yield single report?
4
4
Entering edit mode
8.7 years ago
pbigbig ▴ 250

Hi,

I would like to run FastQC for multiple fastq.gz files (both R1 and R2 of multiple lanes), but this program run in queueing manner, so that it produces a summary for each file, separately.

I wonder that is there any way to merge DATA of all input fastq.gz files together and somehow make FastQC to see them as a single fastq file, so it will produce ONLY 1 summary report for all my files, is that possible? What I need to do to obtain that?

Sorry I am new in this area, thank you in advance for your valuable help!

Regards,
Phuong

fastQC sequencing quality-control • 25k views
ADD COMMENT
16
Entering edit mode
8.7 years ago
James Ashmore ★ 3.4k

EDIT: I highly recommend people use MultiQC now for summarising FastQC reports. It's a fantastic tool.

I recently generated FastQC reports for ~100 FASTQ files and did not want to inspect them all by hand. Instead I wrote a small script to parse the module results in the zip file produced by FastQC. It simply takes the result of each module and converts them to integer scores (1=pass, 0=warning, -1=fail). I write these results out as a CSV file and then go on to plot the results as a heat map using R or Python. Something like this might help you if you want to quickly inspect all FastQC results and see which FASTQ files need manual inspection.

#!/usr/bin/env python3

# Import necessary libraries:
import csv
import os
import subprocess
import zipfile

# List modules used by FastQC:
modules = ['Basic_Statistics',
           'Per_base_sequence_quality',
           'Per_tile_sequence_quality',
           'Per_sequence_quality_scores',
           'Per_base_sequence_content',
           'Per_sequence_GC_content',
           'Per_base_N_content',
           'Sequence_Length_Distribution',
           'Sequence_Duplication_Levels',
           'Overrepresented_sequences',
           'Adapter_Content',
           'Kmer_Content']

# Set dict to convert module results to integer scores:
scores = {'pass': 1,
          'warn': 0,
          'fail': -1}

# Get current working directory:
cwd = os.getcwd()

# Get list of '_fastqc.zip' files generated by FastQC:
files = [file for file in os.listdir(cwd) if file.endswith('_fastqc.zip')]

# List to collect module scores for each '_fastqc.zip' file:
all_mod_scores = []

# Read fastqc_data.txt file in each archive:
for file in files:
    archive = zipfile.ZipFile(file, 'r') # open '_fastqc.zip' file
    members = archive.namelist() # return list of archive members
    fname = [member for member in members if 'fastqc_data.txt' in member][0] # find 'fastqc_data.txt' in members
    data = archive.open(fname) # open 'fastqc_data.txt'

    # Get module scores for this file:
    mod_scores = [file]
    for line in data:
        text = line.decode('utf-8') 
        if '>>' in text and '>>END' not in text:
            text = text.lstrip('>>').split()
            module = '_'.join(text[:-1])
            result = text[-1]
            mod_scores.append(scores[result])

    # Append to all module scores list:
    all_mod_scores.append(mod_scores)

    # close all opened files:
    data.close()
    archive.close()

# Write scores out to a CSV file:
with open('all_mod_scores.csv', 'w') as f:
    writer = csv.writer(f)
    for mod_scores in all_mod_scores:
        writer.writerow(mod_scores)
    f.close()
ADD COMMENT
0
Entering edit mode

Wow, thank you very much, this would be really helpful for me!

ADD REPLY
0
Entering edit mode

Very useful! Thanks!!! I adapted it for my purposes, but there is a small mistake (just a typo) in the code -> line 55 should be: all_mod_scores.

ADD REPLY
1
Entering edit mode

Happy to help, I've corrected the typo, thank you.

ADD REPLY
0
Entering edit mode

This is great! It will be very useful for me too! I have something similar, where I extract all the jpg and put them as thumbnail (HTML), but only for selected metrics. This is very comprehensive! Thanks for sharing

ADD REPLY
0
Entering edit mode

Could you please provide your R or Python script for plotting the cvs file and maybe post a picture.

ADD REPLY
0
Entering edit mode

Hi Ric, I've stopped using this approach - I now use multiqc to summarise my fastqc results.

ADD REPLY
4
Entering edit mode
8.7 years ago
Ying W ★ 4.2k

This sounds like a really bad idea but if you really want to, I believe you can specify multiple fastq when using fastQC on command line

Reasons why this is a bad idea

  • different lanes will have different error profiles, how will you know which one failed if you merge them all together?
  • fastqc only takes a subset of reads, not sure what happens when you specify multiple but you could be sampling fewer reads from one file
  • you can run fastqc in parallel for every lane (assuming u have the I/O bandwidth) so running multiple lanes together should not take too long
ADD COMMENT
0
Entering edit mode

Thank you Ying, I will consider your suggestions.

Given that all my run files are quite similar in error profile (I have check them individually by fastQC, although R2 normally have lower quality than R1), I intended to execute several different scenarios on my whole read set: original (unprocessed), trimmed (by Trimmomatic), merged (by FLASh), and then fastQC them to have an overview comparison of data quality with different processing pipeline (so I would prefer 3 report summaries of 3 scenarios instead of dozen for each). Is my idea feasible?

I will consider your hint regarding the subset sampling of fastQC.

Thanks a lot!

ADD REPLY
2
Entering edit mode
8.7 years ago
arnstrm ★ 1.8k

I think you have to concatenate all gzipped files and then run fastqc on it. For gzipped files you can just:

cat *fastq.gz >> combined.fq.gz

and then:

fastqc combined.fq.gz
ADD COMMENT
0
Entering edit mode

Thank you for your help!

ADD REPLY
1
Entering edit mode
8.7 years ago
cat file1.fq file2.fq file3.fq file4.fq file5.fq > merged.fq
cat file*.fq > merged.fq
ADD COMMENT
0
Entering edit mode

Thanks a lot !

ADD REPLY

Login before adding your answer.

Traffic: 2036 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6