Question: should I merge fastq files for different lanes before do QC?
gravatar for Lila M
2.4 years ago by
Lila M 800
Lila M 800 wrote:

Hi guys, I have a total of 32 samples from RNAseq, paired end (Illumina). For each sample I have 4 different fastq files for 4 different lanes (and forward and reverse). So in total I have 4 forwards and 4 reverse fastq files for each sample. I was wondering if it could be possible and recommendable to merge the 4 fastq files for each forward and reverse and do the QC analysis with fastqc. Or is better to trimming each fastq file independently and then merge?

Many thanks in advance!


rna-seq qc merge • 7.9k views
ADD COMMENTlink modified 16 months ago by blueskypie50 • written 2.4 years ago by Lila M 800

If you already have the files in pieces you could brute force parallelize trimming/alignments etc and then merge the BAM files at the end (before sorting/indexing) but otherwise you can cat the R1 and R2 files (in the same order!) to generate single larger files per sample.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax78k

By lines, do you mean cell lines? Or are those replicates for each sample? Your experimental setup isn't very clear here. Generally, I'd be against merging replicates, especially if you're trying to find differentially expressed genes between your various sample conditions - most programs use replicates as a way of drastically increase the statistical power behind such analyses.

ADD REPLYlink written 2.4 years ago by jared.andrews075.0k

My guess is that lines should be lanes, as in sequencing lanes.

In that case, merging is fine.

ADD REPLYlink written 2.4 years ago by WouterDeCoster43k

Yes my mistake!! They are lanes (edited in my previous post). Thank you very much :)

ADD REPLYlink written 2.4 years ago by Lila M 800

Oh, that makes much more sense. Yes, I'd agree with WouterDeCoster than, merging the F+R FastQs before QC should be fine.

ADD REPLYlink written 2.4 years ago by jared.andrews075.0k

I didn't mean merge F+R, I meant merge F+F+F+F and R+R+R+R and do the QC in the new F and new R and then sort and merge the F+R

ADD REPLYlink written 2.4 years ago by Lila M 800

By merge you mean concatenating technical replicates from same sample? I would argue you should perform QC with files separately, to check for possible batch effects, and merge only after being sure no sizable batch effects are present.

Or by "merge" you mean merge R1+R2 with a program like BBMerge, FLASH or PEAR?

ADD REPLYlink written 2.4 years ago by h.mon29k

With merge I mean cat *R1_.fastq > big_R1.fastq and cat *R2_.fastq > big_R2.fastq not merge forward and reverse in that step.

ADD REPLYlink written 2.4 years ago by Lila M 800
gravatar for i.sudbery
2.4 years ago by
Sheffield, UK
i.sudbery7.0k wrote:

In general we only merge after mapping. There are several reasons for this:

Your QC might pick up a lane specific problem: i.e. 3 of your 4 lanes might have worked fine, but one might have failed. Even if your QC doesn't pick up anything, the mapping might (after all % of uniquely mapped reads is the best QC metric for RNAseq, the others just help you work out what went wrong!).

If you have say a 75% mapping rate, if you merge first you don't know if thats 25% fails for each lane, or 100% for 3 lanes and 0% for the final one.

Also, in an environment with lots of CPU capacity, mapping 4 small files in parallel is faster than 1 larger file (this doesn't hold if most of your time is spent waiting on an execution queue).

In terms of the validity of the final results though, it probably doesn't matter if you merge first or last.

ADD COMMENTlink written 2.4 years ago by i.sudbery7.0k

So you mean to say that it is OK to merge S1_L001_R1_001.fastq and S1_L002_R1_001.fastq because both are forward. Or I should do merging of S1_L001_R1_001.fastq (forward) and S1_L001_R2_001.fastq (reverse). I need your suggestions. I also want to know about cluster based analysis of fastq files (because this is time saving and computationally efficient also, I may be wrong). Can you suggest me some resources for this (ASAP). Thanks.

ADD REPLYlink written 23 months ago by vivekruhela10

If a sample ran on multiple lanes e.g. S1 above on L001 and L002 then you can merge those files by cating together. You should not merge R1 and R2 files unless the reads are being interleaved (which some but not many programs can use).

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax78k

See genomax's answer for merging. In terms of cluster analysis, probably the easiest way is through one of the workflow systems. We use ruffus, along with an in house utility layer. We have pre-made pipelines that handle distribution of fastq mapping jobs accross the cluster. Many people however, like snakemake, which I believe has support for cluster execution (or even cloud execution) in the more recent versions.

You can also do this manually using batch submission or job arrays. How you would do this depends on the queue manager on your cluster and how it is set up. Batch submission using a bash for loops is probably the easiest. See here for an example using the SGE queue manager. Job arrays are the "proper" way of doing this sort of thing, but are harder to set up. For example see these pages about job-arrays in SGE and SLURM.

ADD REPLYlink written 23 months ago by i.sudbery7.0k
gravatar for blueskypie
16 months ago by
United States
blueskypie50 wrote:

For RNAseq, I think whether merging lanes before or after mapping depends on your objective and the function of mapping program. For example, tophat may need all the reads to detect splice junction, i.e. lanes should be merged before mapping. But if the mapper maps each read independently, perhaps merging after mapping is a better solution.

ADD COMMENTlink modified 16 months ago • written 16 months ago by blueskypie50

Its a good point, but as far as I'm aware the only mapper that uses information from one read to inform the mapping of others is STAR in 2-pass mode. I'm pretty sure tophat doesn't.

ADD REPLYlink written 16 months ago by i.sudbery7.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1590 users visited in the last hour