Question: Automating to convert multiple fastq files into one fastq file
3 months ago by
zhou_12280 wrote:

I got six fastq files (three forward and three reverse) for every sample in 96-well plate via NGS. As the first step of SNP calling, I need convert these six files into two files (forward and reverse fastq file) for each sample. Now I am trying to write a shell scripts to automatically merge every three forward (reverse) files into one for multiple samples. Below is the scripts I wrote to automatically convert one fastq file to one bwa file for many samples, but I am asking the scripts to convert three files into one. Thank you.

for fq in ~/NGS/*.fastq
    echo "working with file $fq"

    base=$(basename $fq .fastq)
    echo "base name is $base"


    bwa aln -t 4 GMbwaidx $fq > $bwa

My six files for one sample look like this:


Hello zhou_1228,


and how do you know that these files belong to the same sample? Which part of the filename give that information?

fin swimmer

P001_WB01 represent plate 1, well No. B1

convert one fastq file to one bwa file

There is no bwa file (format), bwa outputs alignments in the SAM format. For this reason, I would write:

3 months ago by
kansas, usa
Malcolm.Cook970 wrote:

Install and use GNU Parallel.

Then use the following model. Remove --dry when you're ready to run:

parallel -k --dry 'bwa aln -t 4 GMbwaidx <(cat NGS/*{1}_{2}*.fastq) > {1}_{2}.sam' :::: <( seq -f 'P%03g' ${nPlates} ) <(seq -f 'WB%02g' ${nWells} )
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB01*.fastq) > P001_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB02*.fastq) > P001_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB03*.fastq) > P001_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB04*.fastq) > P001_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB01*.fastq) > P002_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB02*.fastq) > P002_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB03*.fastq) > P002_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB04*.fastq) > P002_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB01*.fastq) > P003_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB02*.fastq) > P003_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB03*.fastq) > P003_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB04*.fastq) > P003_WB04.sam


The above approach

  • depends upon bwa's ability to stream input
  • works with any number of fastq files per plate_well combination.
  • is not using parallel's ability to run multiple jobs, since presumably you have -t 4 threads available to you
  • assumes your shell is bash, and depends upon its capability for Process Substitution
From my understanding, you run two commands, cat and bwa together in your model. But in my case, for each sample, I firstly need merge three R1.fastq files into one -F.fastq and another three R2.fastq to one -R.fastq, separately. And then run command "bwa mem GMbwaidx -F.fastq -R.fastq > *.sam" to generate sam file. Do you have any suggestion to automatically run these two steps?

Sure. The approach is the same; you just need two calls to cat, using slightly different file wildcarding (aka globbing) in each. Also, I now realize your well identifier has a row and a column component. In this updated example, for brevity, I limit to the first three rows, A through C, and the first two zero-padded columns, 01 through 02:

plate=$(seq -f 'P%03g' 3)
row=$(echo {A..C})
col=$(seq -f '%02g' 2 )
parallel -k --dry 'bwa mem GMbwaidx <(cat NGS/*{1}_W{2}{3}*_R1_*.fastq) <(cat NGS/*{1}_W{2}{3}*_R2_*.fastq) > {1}_W{2}{3}.sam' ::: $plate ::: $row ::: $col
bwa mem GMbwaidx <(cat NGS/*P001_WA01*_R1_*.fastq) <(cat NGS/*P001_WA01*_R2_*.fastq) > P001_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WA02*_R1_*.fastq) <(cat NGS/*P001_WA02*_R2_*.fastq) > P001_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB01*_R1_*.fastq) <(cat NGS/*P001_WB01*_R2_*.fastq) > P001_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB02*_R1_*.fastq) <(cat NGS/*P001_WB02*_R2_*.fastq) > P001_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC01*_R1_*.fastq) <(cat NGS/*P001_WC01*_R2_*.fastq) > P001_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC02*_R1_*.fastq) <(cat NGS/*P001_WC02*_R2_*.fastq) > P001_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA01*_R1_*.fastq) <(cat NGS/*P002_WA01*_R2_*.fastq) > P002_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA02*_R1_*.fastq) <(cat NGS/*P002_WA02*_R2_*.fastq) > P002_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB01*_R1_*.fastq) <(cat NGS/*P002_WB01*_R2_*.fastq) > P002_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB02*_R1_*.fastq) <(cat NGS/*P002_WB02*_R2_*.fastq) > P002_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC01*_R1_*.fastq) <(cat NGS/*P002_WC01*_R2_*.fastq) > P002_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC02*_R1_*.fastq) <(cat NGS/*P002_WC02*_R2_*.fastq) > P002_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA01*_R1_*.fastq) <(cat NGS/*P003_WA01*_R2_*.fastq) > P003_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA02*_R1_*.fastq) <(cat NGS/*P003_WA02*_R2_*.fastq) > P003_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB01*_R1_*.fastq) <(cat NGS/*P003_WB01*_R2_*.fastq) > P003_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB02*_R1_*.fastq) <(cat NGS/*P003_WB02*_R2_*.fastq) > P003_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC01*_R1_*.fastq) <(cat NGS/*P003_WC01*_R2_*.fastq) > P003_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC02*_R1_*.fastq) <(cat NGS/*P003_WC02*_R2_*.fastq) > P003_WC02.sam

As rewritten, the approach

Thank you so much for your reply. I found that there are many GNU parallel package for downloading. My OS is Linux Mint 18.1, so which one I should download? Thank you.

I can not help you much more than to say to install the latest version of Gnu parallel that is packaged for your operating system distribution.

probably install with:

sudo apt-get install parallel

but best to follow you OS documentation, possibly such as: Installing softwares

or, for hints,

I got it. Thank you so much.

Great - glad to help - please upvote and accept the answer!

