I would like to run FastQC for multiple fastq.gz files (both R1 and R2 of multiple lanes), but this program run in queueing manner, so that it produces a summary for each file, separately.
I wonder that is there any way to merge DATA of all input fastq.gz files together and somehow make FastQC to see them as a single fastq file, so it will produce ONLY 1 summary report for all my files, is that possible? What I need to do to obtain that?
Sorry I am new in this area, thank you in advance for your valuable help!
EDIT: I highly recommend people use MultiQC now for summarising FastQC reports. It's a fantastic tool.
I recently generated FastQC reports for ~100 FASTQ files and did not want to inspect them all by hand. Instead I wrote a small script to parse the module results in the zip file produced by FastQC. It simply takes the result of each module and converts them to integer scores (1=pass, 0=warning, -1=fail). I write these results out as a CSV file and then go on to plot the results as a heat map using R or Python. Something like this might help you if you want to quickly inspect all FastQC results and see which FASTQ files need manual inspection.
#!/usr/bin/env python3
# Import necessary libraries:
import csv
import os
import subprocess
import zipfile
# List modules used by FastQC:
modules = ['Basic_Statistics',
'Per_base_sequence_quality',
'Per_tile_sequence_quality',
'Per_sequence_quality_scores',
'Per_base_sequence_content',
'Per_sequence_GC_content',
'Per_base_N_content',
'Sequence_Length_Distribution',
'Sequence_Duplication_Levels',
'Overrepresented_sequences',
'Adapter_Content',
'Kmer_Content']
# Set dict to convert module results to integer scores:
scores = {'pass': 1,
'warn': 0,
'fail': -1}
# Get current working directory:
cwd = os.getcwd()
# Get list of '_fastqc.zip' files generated by FastQC:
files = [file for file in os.listdir(cwd) if file.endswith('_fastqc.zip')]
# List to collect module scores for each '_fastqc.zip' file:
all_mod_scores = []
# Read fastqc_data.txt file in each archive:
for file in files:
archive = zipfile.ZipFile(file, 'r') # open '_fastqc.zip' file
members = archive.namelist() # return list of archive members
fname = [member for member in members if 'fastqc_data.txt' in member][0] # find 'fastqc_data.txt' in members
data = archive.open(fname) # open 'fastqc_data.txt'
# Get module scores for this file:
mod_scores = [file]
for line in data:
text = line.decode('utf-8')
if '>>' in text and '>>END' not in text:
text = text.lstrip('>>').split()
module = '_'.join(text[:-1])
result = text[-1]
mod_scores.append(scores[result])
# Append to all module scores list:
all_mod_scores.append(mod_scores)
# close all opened files:
data.close()
archive.close()
# Write scores out to a CSV file:
with open('all_mod_scores.csv', 'w') as f:
writer = csv.writer(f)
for mod_scores in all_mod_scores:
writer.writerow(mod_scores)
f.close()
This is great! It will be very useful for me too! I have something similar, where I extract all the jpg and put them as thumbnail (HTML), but only for selected metrics. This is very comprehensive! Thanks for sharing
Given that all my run files are quite similar in error profile (I have check them individually by fastQC, although R2 normally have lower quality than R1), I intended to execute several different scenarios on my whole read set: original (unprocessed), trimmed (by Trimmomatic), merged (by FLASh), and then fastQC them to have an overview comparison of data quality with different processing pipeline (so I would prefer 3 report summaries of 3 scenarios instead of dozen for each). Is my idea feasible?
I will consider your hint regarding the subset sampling of fastQC.
Thanks a lot!
ADD REPLY
• link
updated 22 months ago by
Ram
44k
•
written 9.1 years ago by
pbigbig
▴
250
Wow, thank you very much, this would be really helpful for me!
Very useful! Thanks!!! I adapted it for my purposes, but there is a small mistake (just a typo) in the code -> line 55 should be:
all_mod_scores
.Happy to help, I've corrected the typo, thank you.
This is great! It will be very useful for me too! I have something similar, where I extract all the jpg and put them as thumbnail (HTML), but only for selected metrics. This is very comprehensive! Thanks for sharing
Could you please provide your R or Python script for plotting the cvs file and maybe post a picture.
Hi Ric, I've stopped using this approach - I now use multiqc to summarise my fastqc results.