Question: How to align 100 samples with BWA?
0
gravatar for madkitty
2.9 years ago by
madkitty580
Canada
madkitty580 wrote:

We have 100 exomes in one folder on our server, all are stored as a fastq file. What I did until now, was creating index and get ready for the alignment with the following lines:

bwa index -a hg38.fa
samtools faidx hg38.fa
java -jar $PICARD CreateSequenceDictionary REFERENCE=hg38.fa OUTPUT=hg38.dict

Now, the alignment of each exome is the next step, though I am looking for a way to align align all 100 exoms in one command, I don't want to copy/paste the same thing a hundred times. Our exome files have a very generic name, named from 1 to 100.fq

Any clue how I can do that?

bwa alignment next-gen exome • 2.5k views
ADD COMMENTlink modified 2.9 years ago by Zev.Kronenberg11k • written 2.9 years ago by madkitty580
4

Have a look at this for inspiration: bash loop for alignment RNA-seq data
Example is for TopHat but you will get the idea of how to apply this to bwa.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax62k
2
gravatar for Sukhdeep Singh
2.9 years ago by
Sukhdeep Singh9.6k
Netherlands
Sukhdeep Singh9.6k wrote:

It's a much better practice, to spread the load to different processors using the parallel version of mapper and practices from high performance computing (HPC) rather than manually running several files on same processors (single or multiple but same) using a bash loop. This will generate a massive load on the machine, no or poor task management and highly likely with that number of files system might crash. Depending on what system are you using, I highly recommend using a grid scheduler like SGE, read this for a comparison or some introduction from this study.

Once, you have let's say SGE installed, then wrap up your BWA or pBWA code in a qsub shell script and then you can run a bash loop to feed all the files to this qsub or scheduler script and go for a beer or jog. Benefits:

  1. You will spread the load properly and efficiently.

  2. Internal load management and assigning of jobs is taken care of automatically

  3. Proper logs per sample will be generated for later scrutiny and core dumps will be generated with if something fails.

  4. Your system will be much happier.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Sukhdeep Singh9.6k

Thank you, Our files have a very generic name from 1 to 100, my question is how do I write these 3 lines in a batch loop?

ADD REPLYlink written 2.9 years ago by madkitty580
for i in $(ls *.fastq)

do
bwa aln -f input.sai -t 3 hg38.fa ${i%.fastq}
done

Would that work?

ADD REPLYlink written 2.9 years ago by madkitty580
1

This would not work as typed.
Solution below is only as an academic exercise. You probably don't want to run this analysis this way.

A working loop can be (assuming PE file names for each sample are 1read1.fq and 1read2.fq etc)

for i in $(ls -1 *read1*.fastq);
do
bwa mem bwa_index_name $i\read1.fq $i\read2.fq > $i\.sam;
done

As @Sukhdeep pointed out above this loop would submit all 100 jobs at the same time without waiting for the first one to complete. Which would either jam your machine solid or at worst make it crash. If you wish to follow the simple bash script route you may wish to submit small number of jobs each time, let them complete and then start the next set (vary values 1 3 in example below like 4 6, 7 9 etc).

for i in $(seq 1 3);
do
bwa mem bwa_index_name $i\_read1.fq $i\_read2.fq > $i\.sam ;
done
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax62k

Thanks! Can you explain what do values do in (seq 1 3)? From my understanding it will be aligning from exome 1 to 3. Since we have 100 exomes in total, the goal here is to launch a script that will do the job by itself so I don't need to generate 33 different script, I'm not sure the job will be done much faster, ideally I'd like to tell the server to split the job on multiple cores or do the alignment one by one. Does the script above execute alignment one by one or all at once for i within range x to y?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by madkitty580

What kind of hardware do you have access to (cores/RAM)? Is this a standalone server or a cluster?
If this is a standalone machine then one could write a script that waits for the first three jobs to finish before launching the next three but it will be significantly more complex.
Above script launches all jobs (within range of X Y) at the same time.

ADD REPLYlink written 2.9 years ago by genomax62k

It's a hospital server with lots of cores available, that's all I know. What does "1 3" mean in this example? Does it execute file 1 to file 3 then stops or does it tell it to do 3 files at a time???

I usually write a script.sh file and qsub it to the server, smth that looks like

#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=5:00:00
#PBS -N kimyr_g
cd $PBS_O_WORKDIR


script here ...

Thanks

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by madkitty580

Now you say!
You have access to a proper compute cluster (considering the PBS job submission script). You can start 100 separate jobs for the 100 exomes at the same time. Write a loop to write those script files which you can then submit to the cluster.

ADD REPLYlink written 2.9 years ago by genomax62k

Oh good to know that I can start aligning 100 exoms at once ! So would my original script work?

for i in $(ls *.fastq)

do
bwa aln -f input.sai -t 3 hg38.fa ${i%.fastq}
done
ADD REPLYlink written 2.9 years ago by madkitty580

You will have to wrap the PBS related bits around each job so not just as you have typed it above but it will have to be extended.
You can submit all 100 jobs at the same time but depending on resources allocated to your account not all may run at the same time. At least they will be in the queue and will eventually get done.

ADD REPLYlink written 2.9 years ago by genomax62k

Something like this (test with a 2 or 3 files before submitting for all).

for i in $(ls *.fastq)
do
qsub -l nodes=2:pp -l walltime=5:00:00 -N kimyr_g n=8 {any additional things} bwa aln -f input.sai -t 8 hg38.fa ${i%.fastq}
done
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax62k
0
gravatar for Zev.Kronenberg
2.9 years ago by
United States
Zev.Kronenberg11k wrote:

Snakemake all the way:

SGE mode:

snakemake -c "qsub {params.sge_opts}" -s Snakefile -j 100 -w 50

https://bitbucket.org/snakemake/snakemake/wiki/Getting%20Started%20with%20Snakemake%20and%20Qsub

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Zev.Kronenberg11k

Thanks though I am not a computer person and really I don't understand what Snakemake does nor how it does it. :(( How can I apply this to my 100 exoms?

ADD REPLYlink written 2.9 years ago by madkitty580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour