Question: Running and Analyzing fastqc on multiple fastq files
gravatar for ravi.uhdnis
4.0 years ago by
United States
ravi.uhdnis150 wrote:

Hi Everyone,

I am working on Whole Genome Sequencing and analysis of Human genome from illumina HiSeq platform with about 30X coverage. Each sample (human genome) have about 250-300 fastq.gz files, whom i am dealing with 'fastqc' for quality check using following command :

/usr/local/bin/fastqc -t 8 -f fastq -o OUT/ -casava *.gz -noextract

Although it is running fine and generating equal number of "" files which i unzipped using unzip '*.zip'. So, here i have 2 questions:

1. Can i merge two or more fastq files and then run fastqc on those merged files ?. If yes, how should i merge those fastq files ?.

2.  I have to manually check 250-300 fastqc folder to know the quality by opening .html page. Is there any way where i can have summary of overall quality of the fastq files in a flowcell ?.

Please let me know your comments. I'll be highly thankful to you. Best, Ravi


rna-seq next-gen genome • 20k views
ADD COMMENTlink modified 4.0 years ago by Madelaine Gogol5.1k • written 4.0 years ago by ravi.uhdnis150
gravatar for Madelaine Gogol
4.0 years ago by
Madelaine Gogol5.1k
Kansas City
Madelaine Gogol5.1k wrote:

We have a script that will run fastqc and generate a summary report with the images from all the fastq files it was run on. You may also find it useful to systematically parse the fastqc_data.txt files from each run and combine the results that way.

The script is here, but may not be the most useful and documented thing ever... Depends on imagemagick to generate thumbnails...
Also uses this script:

ADD COMMENTlink written 4.0 years ago by Madelaine Gogol5.1k

Visiting from the future... If anyone is dealing with this issue, they may also want to check out MultiQC.

ADD REPLYlink written 2.6 years ago by Madelaine Gogol5.1k

@Madelaine Gogol Just what I was needed!  

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by neaptide0

@Madelaine Gogol,

I have 10 fastq.gz files (including R2 reads). I made a sample_name.txt with name of fastq.gz files (10 files) and ran by the following command. -name sample_Name.txt . But i have got null out put. Please can you help me to run your program 

ADD REPLYlink written 3.5 years ago by BioRyder160

I don't usually run it with that option, but it looks like it's expecting sampleName[tab]adapter sequence in the file. That is just to get the names of the sample. You would still have to pass in the fastq files as an argument. like --files *.fastq.gz (if you were in the same directory).

ADD REPLYlink written 3.5 years ago by Madelaine Gogol5.1k

@Madelaine Gogol, since a long time we are looking for such a nice solution to merge all the fastqc reports in a single html file. However the script runs only on the first file in the folder and stops then. Do you have an idea why? We are using following command:

perl '/home/Desktop/' --name '/home/Desktop/fastq/names' --out '/home/Desktop/fastq/fastqc' --files  '/home/Desktop/fastq/*.fastq'
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by ngsequencing0

Not really... What is the format of the "names" file? Did you make any changes to the fastqc script besides changing the name? Maybe try it with less arguments at first to see if it runs that way - like from inside the directory of fastqs with just --files "*.fastq".

ADD REPLYlink written 3.2 years ago by Madelaine Gogol5.1k
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

Not only can you merge the fastq files but your life might be easier if you do. For merging them, a simple cat will suffice. I should note that you don't have to be delivered 300 some odd files, you can request that whomever is doing the sequencing just give you a two files (assuming paired-end) per sample/library (the bcl2fastq program that they use to process the bcl files produced by the sequencer can trivially do this).

If you don't want to wait until all of the files are merged, you can likely just use a named pipe as input to fastqc. Something like:

mkfifo foo.fastq.gz
cat sample_L1_R1_???.fastq.gz > foo.fastq.gz
fastqc foo.fastq.gz

Given that fastqc is written in java, I can't guarantee that it'll properly handle block gzipped files like that (the java gzip library has been broken for years). You can always zcat instead. I should note that the only reason process substitution likely wouldn't work is that fastqc names the output files after the input file name.

For 2. it depends a bit on what you want. The sequencing facility actually has an idea about that already (it's produced by the machine). It's easy enough to just ask them (they can also give you a break down of how many reads per sample, their average quality (also per sample), etc.). For our internal pipeline, I have a pdf produced with that sort of information, since it's a bit quicker to look first at a single table like that than to trudge through all of the fastqc files. BTW, fastqc also produces an HTML file with the images included. When I QC flowcells before sending results to our local groups those are what I personally look's quicker than dealing with the zip files.

ADD COMMENTlink written 4.0 years ago by Devon Ryan90k

Thank you very much Dr Ryan for the comment. Actually we run illumina HiSeq platform in our lab and i joined recently to handle and analyze the output data. I am running bcl2fastq script but our current version CASAVA_v1.8.1 didn't support the option '--fastq-cluster-count 0' in order to make just one fastq file for one sample.

Anyway, i simply concatenated the fastq files using cat as (for each lane of each sapmle of a flowcell for R1 as well as R2) e.g.

cat  ETH001100_CGATGT_L001_R1_00*  >  ETH001100_CGATGT_L001_R1_1-8.fastq.gz . This way i got 16 fastq files for each sample/flowcell, in total 32 for each sample in both flowcells.  Then i ran fastqc on each of these 32*5=160 files of 5 samples. Is this way correct ?. Please correct me in case i am missing or doing something incorrect anywhere. Thank you. Regards, Ravi

ADD REPLYlink written 4.0 years ago by ravi.uhdnis150

At least the most recent 1.8.X supports setting the cluster count to 0 (it's what I use in my pipeline), so you might consider upgrading.

Regarding the concatenation, why not merge the lanes within at least each flow cell as well? If you're only using one library per sample then that'd make sense. You'd then have 4 files per sample (one forward and one reverse per flow cell). You could also just merge them across flow cells. That's an annoying thing to try and automate, but for single projects it's easy enough.

ADD REPLYlink written 4.0 years ago by Devon Ryan90k

Thank you for the response Dr Ryan. I'll ask to upgrade available CASAVA version so as to use '--fastq-cluster-count' parameter. Yes, that would be much helpful for me but i am not aware whether this information (i.e Lane's) will be required in downstream analysis pipeline/software or not so i was keeping them as it is. Rest, if no such information is required then i'll simply merge them in a flowcell and then the two of same types from both the cells, in order to have just 2 files per sample. Thank you, Ravi

ADD REPLYlink written 4.0 years ago by ravi.uhdnis150

Dr Ryan, i have one more doubt, please give your suggestion.

If i cat like this way:

cat  ETH001100_CGATGT_L001_R1_00*  >  ETH001100_CGATGT_L001_R1_1-8.fastq.gz  (the output file size is 2.1GB)

whereas if i do it like :

zcat  ETH001100_CGATGT_L001_R1_00*  | gzip > ETH001100_CGATGT_L001_R1_1-8.fastq.gz (this file size is 1.7GB). So, why this difference in final .gz file and which way is the correct way of merging the files ?.

ADD REPLYlink written 4.0 years ago by ravi.uhdnis150

That's somewhat expected. If you concatenate two smaller files then the resulting file's size will be the sum of it's components. If you instead compress the decompressed concatenation then you'll get a somewhat smaller file, since it has more to work with when doing the compression (after all, the larger file has more redundancies than each of the smaller files).

ADD REPLYlink written 4.0 years ago by Devon Ryan90k

I am concatenating a large set of data (around 45 GB in total of fastq files). When I follow the script by Dr. Ryan I get the process hung on; i.e. , I can't have anything written to the FIFO pipe file. Is there any fix for this situation?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by alpha.biostat0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour