Question: Run multiple fastq.gz files with FastQC and yield single report?
4
gravatar for pbigbig
3.6 years ago by
pbigbig190
United States
pbigbig190 wrote:

Hi,

I would like to run FastQC for multiple fastq.gz files (both R1 and R2 of multiple lanes), but this program run in queueing manner, so that it produces a summary for each file, seperately.

I wonder that is there any way to merge DATA of all input fastq.gz files together and somehow make FastQC to see them as a single fastq file, so it will produce ONLY 1 summary report for all my files, is that possible? What I need to do to obtain that?

Sorry I am new in this area, thank you in advance for your valuable help!

Regards,

Phuong.

ADD COMMENTlink modified 3.6 years ago by James Ashmore2.6k • written 3.6 years ago by pbigbig190
13
gravatar for James Ashmore
3.6 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

EDIT: I highly recommend people use MultiQC now for summarising FastQC reports. It's a fantastic tool.

I recently generated FastQC reports for ~100 FASTQ files and did not want to inspect them all by hand. Instead I wrote a small script to parse the module results in the zip file produced by FastQC. It simply takes the result of each module and converts them to integer scores (1=pass, 0=warning, -1=fail). I write these results out as a CSV file and then go on to plot the results as a heat map using R or Python. Something like this might help you if you want to quickly inspect all FastQC results and see which FASTQ files need manual inspection.

#!/usr/bin/env python3

# Import necessary libraries:
import csv
import os
import subprocess
import zipfile

# List modules used by FastQC:
modules = ['Basic_Statistics',
           'Per_base_sequence_quality',
           'Per_tile_sequence_quality',
           'Per_sequence_quality_scores',
           'Per_base_sequence_content',
           'Per_sequence_GC_content',
           'Per_base_N_content',
           'Sequence_Length_Distribution',
           'Sequence_Duplication_Levels',
           'Overrepresented_sequences',
           'Adapter_Content',
           'Kmer_Content']

# Set dict to convert module results to integer scores:
scores = {'pass': 1,
          'warn': 0,
          'fail': -1}

# Get current working directory:
cwd = os.getcwd()

# Get list of '_fastqc.zip' files generated by FastQC:
files = [file for file in os.listdir(cwd) if file.endswith('_fastqc.zip')]

# List to collect module scores for each '_fastqc.zip' file:
all_mod_scores = []

# Read fastqc_data.txt file in each archive:
for file in files:
    archive = zipfile.ZipFile(file, 'r') # open '_fastqc.zip' file
    members = archive.namelist() # return list of archive members
    fname = [member for member in members if 'fastqc_data.txt' in member][0] # find 'fastqc_data.txt' in members
    data = archive.open(fname) # open 'fastqc_data.txt'

    # Get module scores for this file:
    mod_scores = [file]
    for line in data:
        text = line.decode('utf-8') 
        if '>>' in text and '>>END' not in text:
            text = text.lstrip('>>').split()
            module = '_'.join(text[:-1])
            result = text[-1]
            mod_scores.append(scores[result])

    # Append to all module scores list:
    all_mod_scores.append(mod_scores)

    # close all opened files:
    data.close()
    archive.close()

# Write scores out to a CSV file:
with open('all_mod_scores.csv', 'w') as f:
    writer = csv.writer(f)
    for mod_scores in all_mod_scores:
        writer.writerow(mod_scores)
    f.close()
ADD COMMENTlink modified 2.6 years ago • written 3.6 years ago by James Ashmore2.6k

Wow, thank you very much, thís would be really helpful for me!
 

ADD REPLYlink written 3.6 years ago by pbigbig190

Very useful! Thanks!!! I adapted it for my purposes, but there is a small mistake (just a typo) in the code -> line 55 should be: all_mod_scores. 

ADD REPLYlink written 3.5 years ago by kajamilanowska0
1

Happy to help, I've corrected the typo, thank you.

ADD REPLYlink written 3.5 years ago by James Ashmore2.6k

This is great! It will be very useful for me too! I have something similar, where I extract all the jpg and put them as thumbnail (HTML), but only for selected metrics. This is very comprehensive! Thanks for sharing

ADD REPLYlink written 3.5 years ago by arnstrm1.7k

Could you please provide your R or Python script for plotting the cvs file and maybe post a picture.

ADD REPLYlink written 2.2 years ago by Ric190

Hi Ric, I've stopped using this approach - I now use multiqc to summarise my fastqc results.

ADD REPLYlink written 2.2 years ago by James Ashmore2.6k
4
gravatar for Ying W
3.6 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

This sounds like a really bad idea but if you really want to, I believe you can specify multiple fastq when using fastQC on command line

Reasons why this is a bad idea

  • different lanes will have different error profiles, how will you know which one failed if you merge them all together?
  • fastqc only takes a subset of reads, not sure what happens when you specify multiple but you could be sampling fewer reads from one file
  • you can run fastqc in parallel for every lane (assuming u have the I/O bandwidth) so running multiple lanes together should not take too long
ADD COMMENTlink written 3.6 years ago by Ying W3.9k

Thank you Ying, I will consider your suggestions.

Given that all my run files are quite similar in error profile (I have check them individually by fastQC, although R2 normally have lower quality than R1), I intended to execute several different scenarios on my whole read set: original (unprocessed), trimmed (by Trimmomatic), merged (by FLASh), and then fastQC them to have an overview comparison of data quality with different processing pipeline (so I would prefer 3 report summaries of 3 scenarios instead of dozen for each). Is my idea feasible?

I will consider your hint regarding the subset sampling of fastQC.

Thanks a lot!

 

ADD REPLYlink written 3.6 years ago by pbigbig190
1
gravatar for Antonio R. Franco
3.6 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:
cat file1.fq file2.fq file3.fq file4.fq file5.fq > merged.fq

cat file*.fq > merged.fq
ADD COMMENTlink written 3.6 years ago by Antonio R. Franco4.0k

Thanks a lot !
 

ADD REPLYlink written 3.6 years ago by pbigbig190
0
gravatar for arnstrm
3.6 years ago by
arnstrm1.7k
Ames, IA
arnstrm1.7k wrote:

I think you have to concatenate all gzipped files and then run fastqc on it. For gzipped files you can just:

cat *fastq.gz >> combined.fq.gz

and then:

fastqc combined.fq.gz

 

ADD COMMENTlink written 3.6 years ago by arnstrm1.7k

Thank you for your help!

ADD REPLYlink written 3.6 years ago by pbigbig190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1899 users visited in the last hour