Question

How to align 100 samples with BWA?

0

Entering edit mode

8.1 years ago

madkitty ▴ 690

We have 100 exomes in one folder on our server, all are stored as a fastq file. What I did until now, was creating index and get ready for the alignment with the following lines:

bwa index -a hg38.fa
samtools faidx hg38.fa
java -jar $PICARD CreateSequenceDictionary REFERENCE=hg38.fa OUTPUT=hg38.dict

Now, the alignment of each exome is the next step, though I am looking for a way to align align all 100 exoms in one command, I don't want to copy/paste the same thing a hundred times. Our exome files have a very generic name, named from 1 to 100.fq

Any clue how I can do that?

next-gen exome alignment bwa • 7.1k views

ADD COMMENT • link updated 8.1 years ago by Zev.Kronenberg 12k • written 8.1 years ago by madkitty ▴ 690

4

Entering edit mode

Have a look at this for inspiration: bash loop for alignment RNA-seq data
Example is for TopHat but you will get the idea of how to apply this to bwa.

ADD REPLY • link 8.1 years ago by GenoMax 141k

score 2 · Answer 1 · 2016-04-01

2

Entering edit mode

8.1 years ago

Sukhi Singh 11k

It's a much better practice, to spread the load to different processors using the parallel version of mapper and practices from high performance computing (HPC) rather than manually running several files on same processors (single or multiple but same) using a bash loop. This will generate a massive load on the machine, no or poor task management and highly likely with that number of files system might crash. Depending on what system are you using, I highly recommend using a grid scheduler like SGE, read this for a comparison or some introduction from this study.

Once, you have let's say SGE installed, then wrap up your BWA or pBWA code in a qsub shell script and then you can run a bash loop to feed all the files to this qsub or scheduler script and go for a beer or jog. Benefits:

You will spread the load properly and efficiently.
Internal load management and assigning of jobs is taken care of automatically
Proper logs per sample will be generated for later scrutiny and core dumps will be generated with if something fails.
Your system will be much happier.

ADD COMMENT • link 8.1 years ago by Sukhi Singh 11k

0

Entering edit mode

Thank you, Our files have a very generic name from 1 to 100, my question is how do I write these 3 lines in a batch loop?

ADD REPLY • link 8.1 years ago by madkitty ▴ 690

0

Entering edit mode

for i in $(ls *.fastq)

do
bwa aln -f input.sai -t 3 hg38.fa ${i%.fastq}
done

Would that work?

ADD REPLY • link 8.1 years ago by madkitty ▴ 690

1

Entering edit mode

This would not work as typed.
Solution below is only as an academic exercise. You probably don't want to run this analysis this way.

A working loop can be (assuming PE file names for each sample are 1read1.fq and 1read2.fq etc)

for i in $(ls -1 *read1*.fastq);
do
bwa mem bwa_index_name $i\read1.fq $i\read2.fq > $i\.sam;
done

As @Sukhdeep pointed out above this loop would submit all 100 jobs at the same time without waiting for the first one to complete. Which would either jam your machine solid or at worst make it crash. If you wish to follow the simple bash script route you may wish to submit small number of jobs each time, let them complete and then start the next set (vary values 1 3 in example below like 4 6, 7 9 etc).

for i in $(seq 1 3);
do
bwa mem bwa_index_name $i\_read1.fq $i\_read2.fq > $i\.sam ;
done

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks! Can you explain what do values do in (seq 1 3)? From my understanding it will be aligning from exome 1 to 3. Since we have 100 exomes in total, the goal here is to launch a script that will do the job by itself so I don't need to generate 33 different script, I'm not sure the job will be done much faster, ideally I'd like to tell the server to split the job on multiple cores or do the alignment one by one. Does the script above execute alignment one by one or all at once for i within range x to y?

ADD REPLY • link 8.0 years ago by madkitty ▴ 690

0

Entering edit mode

What kind of hardware do you have access to (cores/RAM)? Is this a standalone server or a cluster?
If this is a standalone machine then one could write a script that waits for the first three jobs to finish before launching the next three but it will be significantly more complex.
Above script launches all jobs (within range of X Y) at the same time.

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

It's a hospital server with lots of cores available, that's all I know. What does "1 3" mean in this example? Does it execute file 1 to file 3 then stops or does it tell it to do 3 files at a time???

I usually write a script.sh file and qsub it to the server, smth that looks like

#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -l walltime=5:00:00
#PBS -N kimyr_g
cd $PBS_O_WORKDIR


script here ...

Thanks

ADD REPLY • link 8.0 years ago by madkitty ▴ 690

0

Entering edit mode

Now you say!
You have access to a proper compute cluster (considering the PBS job submission script). You can start 100 separate jobs for the 100 exomes at the same time. Write a loop to write those script files which you can then submit to the cluster.

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

Oh good to know that I can start aligning 100 exoms at once ! So would my original script work?

for i in $(ls *.fastq)

do
bwa aln -f input.sai -t 3 hg38.fa ${i%.fastq}
done

ADD REPLY • link 8.0 years ago by madkitty ▴ 690

0

Entering edit mode

You will have to wrap the PBS related bits around each job so not just as you have typed it above but it will have to be extended.
You can submit all 100 jobs at the same time but depending on resources allocated to your account not all may run at the same time. At least they will be in the queue and will eventually get done.

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

Something like this (test with a 2 or 3 files before submitting for all).

for i in $(ls *.fastq)
do
qsub -l nodes=2:pp -l walltime=5:00:00 -N kimyr_g n=8 {any additional things} bwa aln -f input.sai -t 8 hg38.fa ${i%.fastq}
done

ADD REPLY • link 8.0 years ago by GenoMax 141k

score 0 · Answer 2 · 2016-04-01

0

Entering edit mode

8.1 years ago

Zev.Kronenberg 12k

Snakemake all the way:

SGE mode:

snakemake -c "qsub {params.sge_opts}" -s Snakefile -j 100 -w 50

https://bitbucket.org/snakemake/snakemake/wiki/Getting%20Started%20with%20Snakemake%20and%20Qsub

ADD COMMENT • link 8.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks though I am not a computer person and really I don't understand what Snakemake does nor how it does it. :(( How can I apply this to my 100 exoms?

ADD REPLY • link 8.1 years ago by madkitty ▴ 690