Question: Automating to convert multiple fastq files into one fastq file
gravatar for zhou_1228
2.0 years ago by
zhou_12280 wrote:

I got six fastq files (three forward and three reverse) for every sample in 96-well plate via NGS. As the first step of SNP calling, I need convert these six files into two files (forward and reverse fastq file) for each sample. Now I am trying to write a shell scripts to automatically merge every three forward (reverse) files into one for multiple samples. Below is the scripts I wrote to automatically convert one fastq file to one bwa file for many samples, but I am asking the scripts to convert three files into one. Thank you.

for fq in ~/NGS/*.fastq
    echo "working with file $fq"

    base=$(basename $fq .fastq)
    echo "base name is $base"


    bwa aln -t 4 GMbwaidx $fq > $bwa

My six files for one sample look like this:


sequencing snp • 810 views
ADD COMMENTlink modified 2.0 years ago by Malcolm.Cook1.1k • written 2.0 years ago by zhou_12280

Hello zhou_1228,


and how do you know that these files belong to the same sample? Which part of the filename give that information?

fin swimmer

ADD REPLYlink written 2.0 years ago by finswimmer14k

P001_WB01 represent plate 1, well No. B1

ADD REPLYlink written 2.0 years ago by zhou_12280

convert one fastq file to one bwa file

There is no bwa file (format), bwa outputs alignments in the SAM format. For this reason, I would write:

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by h.mon31k
gravatar for Malcolm.Cook
2.0 years ago by
kansas, usa
Malcolm.Cook1.1k wrote:

Install and use GNU Parallel.

Then use the following model. Remove --dry when you're ready to run:

parallel -k --dry 'bwa aln -t 4 GMbwaidx <(cat NGS/*{1}_{2}*.fastq) > {1}_{2}.sam' :::: <( seq -f 'P%03g' ${nPlates} ) <(seq -f 'WB%02g' ${nWells} )
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB01*.fastq) > P001_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB02*.fastq) > P001_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB03*.fastq) > P001_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB04*.fastq) > P001_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB01*.fastq) > P002_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB02*.fastq) > P002_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB03*.fastq) > P002_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB04*.fastq) > P002_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB01*.fastq) > P003_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB02*.fastq) > P003_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB03*.fastq) > P003_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB04*.fastq) > P003_WB04.sam


The above approach

  • depends upon bwa's ability to stream input
  • works with any number of fastq files per plate_well combination.
  • is not using parallel's ability to run multiple jobs, since presumably you have -t 4 threads available to you
  • assumes your shell is bash, and depends upon its capability for Process Substitution
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Malcolm.Cook1.1k

From my understanding, you run two commands, cat and bwa together in your model. But in my case, for each sample, I firstly need merge three R1.fastq files into one -F.fastq and another three R2.fastq to one -R.fastq, separately. And then run command "bwa mem GMbwaidx -F.fastq -R.fastq > *.sam" to generate sam file. Do you have any suggestion to automatically run these two steps?

ADD REPLYlink written 2.0 years ago by zhou_12280

Sure. The approach is the same; you just need two calls to cat, using slightly different file wildcarding (aka globbing) in each. Also, I now realize your well identifier has a row and a column component. In this updated example, for brevity, I limit to the first three rows, A through C, and the first two zero-padded columns, 01 through 02:

plate=$(seq -f 'P%03g' 3)
row=$(echo {A..C})
col=$(seq -f '%02g' 2 )
parallel -k --dry 'bwa mem GMbwaidx <(cat NGS/*{1}_W{2}{3}*_R1_*.fastq) <(cat NGS/*{1}_W{2}{3}*_R2_*.fastq) > {1}_W{2}{3}.sam' ::: $plate ::: $row ::: $col
bwa mem GMbwaidx <(cat NGS/*P001_WA01*_R1_*.fastq) <(cat NGS/*P001_WA01*_R2_*.fastq) > P001_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WA02*_R1_*.fastq) <(cat NGS/*P001_WA02*_R2_*.fastq) > P001_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB01*_R1_*.fastq) <(cat NGS/*P001_WB01*_R2_*.fastq) > P001_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB02*_R1_*.fastq) <(cat NGS/*P001_WB02*_R2_*.fastq) > P001_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC01*_R1_*.fastq) <(cat NGS/*P001_WC01*_R2_*.fastq) > P001_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC02*_R1_*.fastq) <(cat NGS/*P001_WC02*_R2_*.fastq) > P001_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA01*_R1_*.fastq) <(cat NGS/*P002_WA01*_R2_*.fastq) > P002_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA02*_R1_*.fastq) <(cat NGS/*P002_WA02*_R2_*.fastq) > P002_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB01*_R1_*.fastq) <(cat NGS/*P002_WB01*_R2_*.fastq) > P002_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB02*_R1_*.fastq) <(cat NGS/*P002_WB02*_R2_*.fastq) > P002_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC01*_R1_*.fastq) <(cat NGS/*P002_WC01*_R2_*.fastq) > P002_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC02*_R1_*.fastq) <(cat NGS/*P002_WC02*_R2_*.fastq) > P002_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA01*_R1_*.fastq) <(cat NGS/*P003_WA01*_R2_*.fastq) > P003_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA02*_R1_*.fastq) <(cat NGS/*P003_WA02*_R2_*.fastq) > P003_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB01*_R1_*.fastq) <(cat NGS/*P003_WB01*_R2_*.fastq) > P003_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB02*_R1_*.fastq) <(cat NGS/*P003_WB02*_R2_*.fastq) > P003_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC01*_R1_*.fastq) <(cat NGS/*P003_WC01*_R2_*.fastq) > P003_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC02*_R1_*.fastq) <(cat NGS/*P003_WC02*_R2_*.fastq) > P003_WC02.sam

As rewritten, the approach

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Malcolm.Cook1.1k

Thank you so much for your reply. I found that there are many GNU parallel package for downloading. My OS is Linux Mint 18.1, so which one I should download? Thank you.

ADD REPLYlink written 2.0 years ago by zhou_12280

I can not help you much more than to say to install the latest version of Gnu parallel that is packaged for your operating system distribution.

probably install with:

sudo apt-get install parallel

but best to follow you OS documentation, possibly such as: Installing softwares

or, for hints,

ADD REPLYlink written 24 months ago by Malcolm.Cook1.1k

I got it. Thank you so much.

ADD REPLYlink written 24 months ago by zhou_12280

Great - glad to help - please upvote and accept the answer!

ADD REPLYlink written 23 months ago by Malcolm.Cook1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour