Question: Automating to convert multiple fastq files into one fastq file
0
gravatar for zhou_1228
29 days ago by
zhou_12280
zhou_12280 wrote:

I got six fastq files (three forward and three reverse) for every sample in 96-well plate via NGS. As the first step of SNP calling, I need convert these six files into two files (forward and reverse fastq file) for each sample. Now I am trying to write a shell scripts to automatically merge every three forward (reverse) files into one for multiple samples. Below is the scripts I wrote to automatically convert one fastq file to one bwa file for many samples, but I am asking the scripts to convert three files into one. Thank you.

for fq in ~/NGS/*.fastq
    do
    echo "working with file $fq"

    base=$(basename $fq .fastq)
    echo "base name is $base"

    bwa=~/results/bwa/${base}.bwa

    bwa aln -t 4 GMbwaidx $fq > $bwa
    done

My six files for one sample look like this:

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq
143_P001_WB01_S13_L002_R1_001.fastq

142_P001_WB01_S1751_L008_R2_001.fastq
143_P001_WB01_S13_L001_R2_001.fastq
143_P001_WB01_S13_L002_R2_001.fastq
sequencing snp • 224 views
ADD COMMENTlink modified 27 days ago by Malcolm.Cook900 • written 29 days ago by zhou_12280
1

Hello zhou_1228,

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq

and how do you know that these files belong to the same sample? Which part of the filename give that information?

fin swimmer

ADD REPLYlink written 29 days ago by finswimmer8.2k

P001_WB01 represent plate 1, well No. B1

ADD REPLYlink written 29 days ago by zhou_12280

convert one fastq file to one bwa file

There is no bwa file (format), bwa outputs alignments in the SAM format. For this reason, I would write:

bwa=~/results/bwa/${base}.sam
ADD REPLYlink modified 29 days ago • written 29 days ago by h.mon22k
2
gravatar for Malcolm.Cook
27 days ago by
Malcolm.Cook900
kansas, usa
Malcolm.Cook900 wrote:

Install and use GNU Parallel.

Then use the following model. Remove --dry when you're ready to run:

nPlates=3
nWells=4
parallel -k --dry 'bwa aln -t 4 GMbwaidx <(cat NGS/*{1}_{2}*.fastq) > {1}_{2}.sam' :::: <( seq -f 'P%03g' ${nPlates} ) <(seq -f 'WB%02g' ${nWells} )
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB01*.fastq) > P001_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB02*.fastq) > P001_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB03*.fastq) > P001_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB04*.fastq) > P001_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB01*.fastq) > P002_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB02*.fastq) > P002_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB03*.fastq) > P002_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB04*.fastq) > P002_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB01*.fastq) > P003_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB02*.fastq) > P003_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB03*.fastq) > P003_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB04*.fastq) > P003_WB04.sam

Notes:

The above approach

  • depends upon bwa's ability to stream input
  • works with any number of fastq files per plate_well combination.
  • is not using parallel's ability to run multiple jobs, since presumably you have -t 4 threads available to you
  • assumes your shell is bash, and depends upon its capability for Process Substitution
ADD COMMENTlink modified 24 days ago • written 27 days ago by Malcolm.Cook900

From my understanding, you run two commands, cat and bwa together in your model. But in my case, for each sample, I firstly need merge three R1.fastq files into one -F.fastq and another three R2.fastq to one -R.fastq, separately. And then run command "bwa mem GMbwaidx -F.fastq -R.fastq > *.sam" to generate sam file. Do you have any suggestion to automatically run these two steps?

ADD REPLYlink written 26 days ago by zhou_12280

Sure. The approach is the same; you just need two calls to cat, using slightly different file wildcarding (aka globbing) in each. Also, I now realize your well identifier has a row and a column component. In this updated example, for brevity, I limit to the first three rows, A through C, and the first two zero-padded columns, 01 through 02:

plate=$(seq -f 'P%03g' 3)
row=$(echo {A..C})
col=$(seq -f '%02g' 2 )
parallel -k --dry 'bwa mem GMbwaidx <(cat NGS/*{1}_W{2}{3}*_R1_*.fastq) <(cat NGS/*{1}_W{2}{3}*_R2_*.fastq) > {1}_W{2}{3}.sam' ::: $plate ::: $row ::: $col
bwa mem GMbwaidx <(cat NGS/*P001_WA01*_R1_*.fastq) <(cat NGS/*P001_WA01*_R2_*.fastq) > P001_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WA02*_R1_*.fastq) <(cat NGS/*P001_WA02*_R2_*.fastq) > P001_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB01*_R1_*.fastq) <(cat NGS/*P001_WB01*_R2_*.fastq) > P001_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB02*_R1_*.fastq) <(cat NGS/*P001_WB02*_R2_*.fastq) > P001_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC01*_R1_*.fastq) <(cat NGS/*P001_WC01*_R2_*.fastq) > P001_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC02*_R1_*.fastq) <(cat NGS/*P001_WC02*_R2_*.fastq) > P001_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA01*_R1_*.fastq) <(cat NGS/*P002_WA01*_R2_*.fastq) > P002_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA02*_R1_*.fastq) <(cat NGS/*P002_WA02*_R2_*.fastq) > P002_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB01*_R1_*.fastq) <(cat NGS/*P002_WB01*_R2_*.fastq) > P002_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB02*_R1_*.fastq) <(cat NGS/*P002_WB02*_R2_*.fastq) > P002_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC01*_R1_*.fastq) <(cat NGS/*P002_WC01*_R2_*.fastq) > P002_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC02*_R1_*.fastq) <(cat NGS/*P002_WC02*_R2_*.fastq) > P002_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA01*_R1_*.fastq) <(cat NGS/*P003_WA01*_R2_*.fastq) > P003_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA02*_R1_*.fastq) <(cat NGS/*P003_WA02*_R2_*.fastq) > P003_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB01*_R1_*.fastq) <(cat NGS/*P003_WB01*_R2_*.fastq) > P003_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB02*_R1_*.fastq) <(cat NGS/*P003_WB02*_R2_*.fastq) > P003_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC01*_R1_*.fastq) <(cat NGS/*P003_WC01*_R2_*.fastq) > P003_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC02*_R1_*.fastq) <(cat NGS/*P003_WC02*_R2_*.fastq) > P003_WC02.sam

As rewritten, the approach

ADD REPLYlink modified 24 days ago • written 24 days ago by Malcolm.Cook900

Thank you so much for your reply. I found that there are many GNU parallel package for downloading. My OS is Linux Mint 18.1, so which one I should download? Thank you.

ADD REPLYlink written 22 days ago by zhou_12280

I can not help you much more than to say to install the latest version of Gnu parallel that is packaged for your operating system distribution.

probably install with:

sudo apt-get install parallel

but best to follow you OS documentation, possibly such as: Installing softwares

or, for hints, https://www.gnu.org/software/parallel/

ADD REPLYlink written 17 days ago by Malcolm.Cook900

I got it. Thank you so much.

ADD REPLYlink written 15 days ago by zhou_12280

Great - glad to help - please upvote and accept the answer!

ADD REPLYlink written 10 days ago by Malcolm.Cook900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1405 users visited in the last hour