Question: Fastq Files From Different Flowcells
5
gravatar for hellbio
6.3 years ago by
hellbio380
hellbio380 wrote:

Hi,

For a single sample, i have several paired-end fastq files from four different flowcells. i.e. fastq files from different lanes from each flowcell. Instead of processing individual fastq files from different flowcells, can i merge all the forward reads(from different flowcells and different lanes) into a single fastq file and all the reverse end reads into another fastq file?

Thanks

fastq • 14k views
ADD COMMENTlink modified 6.3 years ago by Pierre Lindenbaum124k • written 6.3 years ago by hellbio380
5
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

Yes, you can (see BruceyB's answer) but that's usually a bad idea.

You can process the fastqs in parallel using , for example make with the option -j (number of parallel tasks), and merge the SAM files later.

enter image description here

ADD COMMENTlink written 6.3 years ago by Pierre Lindenbaum124k

well i have fastq files from 8 lanes i.e. 8pairs of forward and reverse reads. IF we map them individually, we will end up with 8 sam files which has to be merged. In this case it becomes so complex with 8 different sam files to be merged. So, would it be wise to concatenate the fastq files and then generate a single sam/bam file?

ADD REPLYlink written 6.3 years ago by hellbio380

if time is not problem, concatenate your FASTQs. If you can align the 8 pairs of fastq , convert to BAM and sort 8 jobs in *parallel, then you'll get your result faster.

" it becomes so complex.." : why ? A makefile will solve your problems.

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum124k

comment from @notSoJunkDNA ( https://twitter.com/notSoJunkDNA/status/365440417212276736 ) "doesn't apply to all pipelines. Tophat for instance needs all the reads..."

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum124k

could you please elaborate how a makefile will solve the problem? just curious...

ADD REPLYlink written 6.3 years ago by Sebastian Kurscheid300

with a makefile you can use something $(foreach,FASTQ,1 2 3 4 5 6 7 8, $(eval $(call alignwithbwa ${FASTQ}))) . See http://www.gnu.org/software/make/manual/html_node/Eval-Function.html

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum124k

Could you please provide a sample make file, which you have been using. Make file might make life easier in case WGS data.

ADD REPLYlink written 5.2 years ago by hellbio380

search github: https://gist.github.com/search?l=makefile&q=mpileup

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Pierre Lindenbaum124k

I would also like to mention that my data is paired-end data

ADD REPLYlink written 5.2 years ago by hellbio380

why ?

that's usually a bad idea.

any thing else beside speed and RG

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Medhat8.5k

Is processing Lane separately faster than using bwa on the merged fastq file with thread option ? Which one is faster ? Strategy A or B ? I think those strategies are equivalent.

Strategy A : Makefile

    bwa lane1.fastq
    bwa lane2.fastq 
    bwa lane3.fastq
    bwa lane4.fastq

Strategy B : merged

  bwa all.lane.fastq -t 4
ADD REPLYlink written 2.4 years ago by sacha1.8k
3
gravatar for BruceB
6.3 years ago by
BruceB330
Cambridge, UK
BruceB330 wrote:

Yes, you can. The simplest way of doing this is with 'cat' on the terminal. This will concatenate the files you choose into one FQ file. E.g. cat R1_001.fq.gz R1_002.fq.gz ... R1_n.fq.gz > R1_combined.fq.gz

ADD COMMENTlink written 6.3 years ago by BruceB330

So it can be done by concatenating all the forward reads to 1_fastq.gz and reverse reads to 2_fastq.gz and then mapping the paired-end files to a single bam file.

ADD REPLYlink written 6.3 years ago by hellbio380

Yes, that is exactly what I would do (and have done in the recent past). Once concatenated, you would never know they came from different lanes.

ADD REPLYlink written 6.3 years ago by BruceB330

Not exactly, lane number is also represented in the sequence identifier, see http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Each entry in a FASTQ file consists of four lines:
• Sequence identifier
• Sequence
• Quality score identifier line (consisting of a +)
• Quality score

Each sequence identifier, the line that precedes the sequence and describes it, needs to be in the following format:

@<instrument>:<run number="">:<flowcell id="">:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered="">:<control number="">:<index sequence="">

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by chen1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1668 users visited in the last hour