Question: Run Hundreds Of Bwa Commands Without Waiting
4
gravatar for Bioscientist
7.4 years ago by
Bioscientist1.6k
Bioscientist1.6k wrote:

Hi guys I'm analyzing some high-coverage trio data. So Need to run BWA for hundreds of fastq.gz files. Obviously I should write some script to finish such task without waiting and typing in hundreds of commands one by one. But as a beginner without coding experience, I don't know how to do.

For example, I just put

bwa aln -t 24 index file1>1.sam
bwa aln -t 24 index file2>2.sam
bwa aln -t 24 index file3>3.sam
...............
..............

into the script, and run it..........and it doesn't work at all. I know I must miss sth., say, the pathway for fastq files.

anyone can give any pattern about such script of executing multiple jobs? thx

bwa • 5.3k views
ADD COMMENTlink modified 7.4 years ago by Sean Davis25k • written 7.4 years ago by Bioscientist1.6k
10
gravatar for Farhat
7.4 years ago by
Farhat2.8k
Pune, India
Farhat2.8k wrote:

GNU parallel could also help with something like this. This can be handled in a single line while allowing for multiprocessing with something like (untested)

parallel bwa aln -t 24 index {} ">" {.}.sam ::: file*
ADD COMMENTlink written 7.4 years ago by Farhat2.8k
8
gravatar for Aleksandr Levchuk
7.4 years ago by
United States
Aleksandr Levchuk3.1k wrote:

Programming boils down to 2 organization things: variables and functions; and 2 action things: if-statements and for-loops.

All you need here is a for-loop and a variable (lets name it "i").

In Bash the code would be:

for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done

Output:

bwa aln -t 24 index file003 > 003.sam
bwa aln -t 24 index file004 > 004.sam
bwa aln -t 24 index file005 > 005.sam
...
bwa aln -t 24 index file109 > 109.sam
bwa aln -t 24 index file110 > 110.sam
bwa aln -t 24 index file111 > 111.sam

To put this into a script and run it. Do this:

# Generate Script
for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done > my_script.sssh

# Make script executable
chmod +x my_script.sssh

# Run script
./my_script.sssh
ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Aleksandr Levchuk3.1k
2

Why not then use seq -w 0010022 0010077 and "SRR$i"? I can tell that you haven't tried answering my "what happens when" questions.

ADD REPLYlink written 7.4 years ago by Aleksandr Levchuk3.1k

What happens when you run seq -w 3 111 in Bash by itself? What happens when you run it without the -w?

ADD REPLYlink written 7.4 years ago by Aleksandr Levchuk3.1k

thx guys.but actually the name of the fastq files here is not nubers like 1,2,3,4... but like SRR0010022, so seq -w 3 111 doesn't really work....

ADD REPLYlink written 7.4 years ago by Bioscientist1.6k
6
gravatar for Ketil
7.4 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

You probably don't want to run hundreds of alignment jobs serially on a single-CPU machine, but neither do you want to launch hundreds of simultaneous jobs which will compete for CPU, and worse, exhaust memory. I tend to use a makefile to control parallelism. That is, I'd have a Makefile with a target like

%.sam: %.txt
     bwa aln -t 24 index $< > $@

I can then do something like

make -j 8 file-{3..111}.sam

which will issue the relevant bwa commands in parallel, limited to eight simultaneous jobs. Another advantage is of course that make won't re-create existing files.

If you want to do this in shell, it's a bit less flexible, but you can get away with

for a in {0..9}; do
  for b in {0..9}; do
     bwa aln -t 24 index file-$a$b.txt > file-$a$b.sam &
  done
  wait
done

where the inner loop will spawn jobs in the background (due to &) and wait will pause until each batch of ten jobs are finished, before launching the next ten.

ADD COMMENTlink written 7.4 years ago by Ketil3.9k

Thanks ketil.I've been too busy these days to read all your kind and excellent posts here. I actually run my jobs on clusters so I can submit many jobs at the same time; each job will automatically catch one machine.

ADD REPLYlink written 7.4 years ago by Bioscientist1.6k
2
gravatar for Jeremy Leipzig
7.4 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

a variation on Aleksandr's approach:

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam;
done

if you run an executable with the & it will run all the processes in the background simultaneously, which may overwhelm your server

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam &
done
ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Jeremy Leipzig18k
1
gravatar for Sean Davis
7.4 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Think about using a simple batching system such as SLURM or slightly more complicated Sun Grid Engine for your machine(s) if you are getting into second-gen sequencing analysis, even if on a single machine. It is quite liberating to simply throw jobs into a queue and let the batch system deal with the consequences. Naming jobs, deleting them, controlling resource utilization (reduced number of jobs running during the day, for example), tracking job progress are all benefits. Of course, you pay a price in added complexity, but we have found it to be worth it for our small group.

ADD COMMENTlink written 7.4 years ago by Sean Davis25k

+1 for the SGE recommendation, it's very useful even for a small group.

ADD REPLYlink written 7.2 years ago by Vitis1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour