Question: should I merge fastq files for different lanes before do QC?
4
gravatar for Lila M
19 months ago by
Lila M 470
UK
Lila M 470 wrote:

Hi guys, I have a total of 32 samples from RNAseq, paired end (Illumina). For each sample I have 4 different fastq files for 4 different lanes (and forward and reverse). So in total I have 4 forwards and 4 reverse fastq files for each sample. I was wondering if it could be possible and recommendable to merge the 4 fastq files for each forward and reverse and do the QC analysis with fastqc. Or is better to trimming each fastq file independently and then merge?

Many thanks in advance!

Best

rna-seq qc merge • 4.9k views
ADD COMMENTlink modified 6 months ago by blueskypie30 • written 19 months ago by Lila M 470
1

If you already have the files in pieces you could brute force parallelize trimming/alignments etc and then merge the BAM files at the end (before sorting/indexing) but otherwise you can cat the R1 and R2 files (in the same order!) to generate single larger files per sample.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax65k

By lines, do you mean cell lines? Or are those replicates for each sample? Your experimental setup isn't very clear here. Generally, I'd be against merging replicates, especially if you're trying to find differentially expressed genes between your various sample conditions - most programs use replicates as a way of drastically increase the statistical power behind such analyses.

ADD REPLYlink written 19 months ago by jared.andrews072.2k
1

My guess is that lines should be lanes, as in sequencing lanes.

In that case, merging is fine.

ADD REPLYlink written 19 months ago by WouterDeCoster38k

Yes my mistake!! They are lanes (edited in my previous post). Thank you very much :)

ADD REPLYlink written 19 months ago by Lila M 470

Oh, that makes much more sense. Yes, I'd agree with WouterDeCoster than, merging the F+R FastQs before QC should be fine.

ADD REPLYlink written 19 months ago by jared.andrews072.2k

I didn't mean merge F+R, I meant merge F+F+F+F and R+R+R+R and do the QC in the new F and new R and then sort and merge the F+R

ADD REPLYlink written 19 months ago by Lila M 470

By merge you mean concatenating technical replicates from same sample? I would argue you should perform QC with files separately, to check for possible batch effects, and merge only after being sure no sizable batch effects are present.

Or by "merge" you mean merge R1+R2 with a program like BBMerge, FLASH or PEAR?

ADD REPLYlink written 19 months ago by h.mon24k

With merge I mean cat *R1_.fastq > big_R1.fastq and cat *R2_.fastq > big_R2.fastq not merge forward and reverse in that step.

ADD REPLYlink written 19 months ago by Lila M 470
9
gravatar for i.sudbery
19 months ago by
i.sudbery4.3k
Sheffield, UK
i.sudbery4.3k wrote:

In general we only merge after mapping. There are several reasons for this:

Your QC might pick up a lane specific problem: i.e. 3 of your 4 lanes might have worked fine, but one might have failed. Even if your QC doesn't pick up anything, the mapping might (after all % of uniquely mapped reads is the best QC metric for RNAseq, the others just help you work out what went wrong!).

If you have say a 75% mapping rate, if you merge first you don't know if thats 25% fails for each lane, or 100% for 3 lanes and 0% for the final one.

Also, in an environment with lots of CPU capacity, mapping 4 small files in parallel is faster than 1 larger file (this doesn't hold if most of your time is spent waiting on an execution queue).

In terms of the validity of the final results though, it probably doesn't matter if you merge first or last.

ADD COMMENTlink written 19 months ago by i.sudbery4.3k

So you mean to say that it is OK to merge S1_L001_R1_001.fastq and S1_L002_R1_001.fastq because both are forward. Or I should do merging of S1_L001_R1_001.fastq (forward) and S1_L001_R2_001.fastq (reverse). I need your suggestions. I also want to know about cluster based analysis of fastq files (because this is time saving and computationally efficient also, I may be wrong). Can you suggest me some resources for this (ASAP). Thanks.

ADD REPLYlink written 12 months ago by vivekruhela10
3

If a sample ran on multiple lanes e.g. S1 above on L001 and L002 then you can merge those files by cating together. You should not merge R1 and R2 files unless the reads are being interleaved (which some but not many programs can use).

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax65k
1

See genomax's answer for merging. In terms of cluster analysis, probably the easiest way is through one of the workflow systems. We use ruffus, along with an in house utility layer. We have pre-made pipelines that handle distribution of fastq mapping jobs accross the cluster. Many people however, like snakemake, which I believe has support for cluster execution (or even cloud execution) in the more recent versions.

You can also do this manually using batch submission or job arrays. How you would do this depends on the queue manager on your cluster and how it is set up. Batch submission using a bash for loops is probably the easiest. See here for an example using the SGE queue manager. Job arrays are the "proper" way of doing this sort of thing, but are harder to set up. For example see these pages about job-arrays in SGE and SLURM.

ADD REPLYlink written 12 months ago by i.sudbery4.3k
2
gravatar for blueskypie
6 months ago by
blueskypie30
United States
blueskypie30 wrote:

For RNAseq, I think whether merging lanes before or after mapping depends on your objective and the function of mapping program. For example, tophat may need all the reads to detect splice junction, i.e. lanes should be merged before mapping. But if the mapper maps each read independently, perhaps merging after mapping is a better solution.

ADD COMMENTlink modified 6 months ago • written 6 months ago by blueskypie30
3

Its a good point, but as far as I'm aware the only mapper that uses information from one read to inform the mapping of others is STAR in 2-pass mode. I'm pretty sure tophat doesn't.

ADD REPLYlink written 6 months ago by i.sudbery4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour