Run Hundreds Of Bwa Commands Without Waiting
5
4
Entering edit mode
12.9 years ago
Bioscientist ★ 1.7k

Hi guys I'm analyzing some high-coverage trio data. So Need to run BWA for hundreds of fastq.gz files. Obviously I should write some script to finish such task without waiting and typing in hundreds of commands one by one. But as a beginner without coding experience, I don't know how to do.

For example, I just put

bwa aln -t 24 index file1>1.sam
bwa aln -t 24 index file2>2.sam
bwa aln -t 24 index file3>3.sam
...............
..............

into the script, and run it..........and it doesn't work at all. I know I must miss sth., say, the pathway for fastq files.

anyone can give any pattern about such script of executing multiple jobs? thx

bwa • 7.4k views
ADD COMMENT
10
Entering edit mode
12.9 years ago
Farhat ★ 2.9k

GNU parallel could also help with something like this. This can be handled in a single line while allowing for multiprocessing with something like (untested)

parallel bwa aln -t 24 index {} ">" {.}.sam ::: file*
ADD COMMENT
8
Entering edit mode
12.9 years ago

Programming boils down to 2 organization things: variables and functions; and 2 action things: if-statements and for-loops.

All you need here is a for-loop and a variable (lets name it "i").

In Bash the code would be:

for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done

Output:

bwa aln -t 24 index file003 > 003.sam
bwa aln -t 24 index file004 > 004.sam
bwa aln -t 24 index file005 > 005.sam
...
bwa aln -t 24 index file109 > 109.sam
bwa aln -t 24 index file110 > 110.sam
bwa aln -t 24 index file111 > 111.sam

To put this into a script and run it. Do this:

# Generate Script
for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done > my_script.sssh

# Make script executable
chmod +x my_script.sssh

# Run script
./my_script.sssh
ADD COMMENT
2
Entering edit mode

Why not then use seq -w 0010022 0010077 and "SRR$i"? I can tell that you haven't tried answering my "what happens when" questions.

ADD REPLY
0
Entering edit mode

What happens when you run seq -w 3 111 in Bash by itself? What happens when you run it without the -w?

ADD REPLY
0
Entering edit mode

thx guys.but actually the name of the fastq files here is not nubers like 1,2,3,4... but like SRR0010022, so seq -w 3 111 doesn't really work....

ADD REPLY
6
Entering edit mode
12.9 years ago
Ketil 4.1k

You probably don't want to run hundreds of alignment jobs serially on a single-CPU machine, but neither do you want to launch hundreds of simultaneous jobs which will compete for CPU, and worse, exhaust memory. I tend to use a makefile to control parallelism. That is, I'd have a Makefile with a target like

%.sam: %.txt
     bwa aln -t 24 index $< > $@

I can then do something like

make -j 8 file-{3..111}.sam

which will issue the relevant bwa commands in parallel, limited to eight simultaneous jobs. Another advantage is of course that make won't re-create existing files.

If you want to do this in shell, it's a bit less flexible, but you can get away with

for a in {0..9}; do
  for b in {0..9}; do
     bwa aln -t 24 index file-$a$b.txt > file-$a$b.sam &
  done
  wait
done

where the inner loop will spawn jobs in the background (due to &) and wait will pause until each batch of ten jobs are finished, before launching the next ten.

ADD COMMENT
0
Entering edit mode

Thanks ketil.I've been too busy these days to read all your kind and excellent posts here. I actually run my jobs on clusters so I can submit many jobs at the same time; each job will automatically catch one machine.

ADD REPLY
2
Entering edit mode
12.9 years ago

a variation on Aleksandr's approach:

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam;
done

if you run an executable with the & it will run all the processes in the background simultaneously, which may overwhelm your server

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam &
done
ADD COMMENT
1
Entering edit mode
12.9 years ago

Think about using a simple batching system such as SLURM or slightly more complicated Sun Grid Engine for your machine(s) if you are getting into second-gen sequencing analysis, even if on a single machine. It is quite liberating to simply throw jobs into a queue and let the batch system deal with the consequences. Naming jobs, deleting them, controlling resource utilization (reduced number of jobs running during the day, for example), tracking job progress are all benefits. Of course, you pay a price in added complexity, but we have found it to be worth it for our small group.

ADD COMMENT
0
Entering edit mode

+1 for the SGE recommendation, it's very useful even for a small group.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6