Question: Bwa Option That Stops After N Minutes
3
gravatar for 2184687-1231-83-
7.6 years ago by
2184687-1231-83-5.0k wrote:

I would like to know if it's possible, or if not, to propose it, the option of running bwa alignment for N minutes from a list of input sequences, that would produce output up until the point where N minutes have been used for the computation of the alignment of the initial fraction of query sequences. For example:

bwa bwasw -s 10 -N 120 target.prefix query.fa

This command will process 10 sequences at a time in query.fa and produce their output alignments up until the point where about 120 minutes have been used in the calculation. So a fraction of the sequences at the beginning of query.fa will be processed if they take longer, and the rest of the sequences won't be processed.

It may be that this can already be achieved by clever piping in and out of the input data. Any ideas?

Edit: the doalarm solution doesn't seem to work for me, since it returns no sequences if it's stopped before finishing. See below:

I run it here with different $sec values, and from 190seconds onwards, it gives all results for the 556 query sequences:

~/src/doalarm/doalarm-0.1.7/doalarm $sec \
~/src/bwa/latest/bwa/bwa bwasw -z 1000 -s 10 \
reference_genome.fa.gz querysequences.fa > /tmp/ex.$sec.sam

  0  /tmp/ex.100.sam
  0  /tmp/ex.110.sam
  0  /tmp/ex.120.sam
  0  /tmp/ex.130.sam
  0  /tmp/ex.140.sam
  0  /tmp/ex.150.sam
  0  /tmp/ex.160.sam
  0  /tmp/ex.170.sam
  0  /tmp/ex.180.sam
  0  /tmp/ex.182.sam
  0  /tmp/ex.184.sam
  0  /tmp/ex.186.sam
 142765  /tmp/ex.190.sam
 142765  /tmp/ex.200.sam
feature bwa • 2.8k views
ADD COMMENTlink modified 7.6 years ago by Damian Kao15k • written 7.6 years ago by 2184687-1231-83-5.0k

I am confused,why did you put a bounty? did the alarm version by DK not work? Did you test it?

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

@Michael Dondrup: It doesn't seem to work for me, see edit.

ADD REPLYlink written 7.6 years ago by 2184687-1231-83-5.0k

I think what we have seen until now is that, bwa (unlike e.g. samtools) is not unix pipe (|) friendly in the sense that you cannot really chunk the processing.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

avilella, how variable is the time needed to map a sequence? can you make an empirical table which tells you that, say, mapping x sequences takes less than y minutes 95% of the times?

ADD REPLYlink written 7.6 years ago by xenophiliuslovegood150
6
gravatar for iw9oel_ad
7.6 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

I would not try to interrupt BWA, but rather create a script to feed the sequences to BWA for n minutes:

Create a Perl script that writes sequences to STDOUT for n minutes e.g. using Perl alarm as other answers have suggested. When you do this catch the signal or use some other way to ensure that the last sequence is written completely.

Create a FIFO using mkfifo as a query input for BWA.

mkfifo /tmp/my_query_fifo
./my_query_producer.pl -sec 7200 > /tmp/my_query_fifo

Elsewhere

bwa bwasw my_reference_prefix /tmp/my_query_fifo

When the producer exits, BWA will exit cleanly after writing its results.


Okay, here are my tests. First of all note that the bwasw chunksize parameter is not settable from the command line in bwa-0.6.1. When you see smaller chunks than 10^7 in the following output, it's because I've added this line to bwtsw2main.c:

case 'C': opt->chunk_size = atoi(optarg); break;

This gives an extra option -C <INT> to set chunk size. This is just for my covenience - the test still works on vanilla bwa, but you will see different numbers.

In a shell:

mkfifo wibble
bwa bwasw -C 10000 c1215.fasta wibble > out.sam

In my Emacs running sbcl under Slime:

(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(defparameter *f* (open "/Users/keith/wibble" :direction :output
                        :if-exists :append))

(with-seq-input (seqi "/Users/keith/c1215_reads_1.fastq" :fastq)
  (loop
     while (has-more-p seqi)
     do (write-fastq-sequence (next seqi) *f*))))

In the first shell I now see:

[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...

and so on. I can repeat the Lisp command several times, each time getting more alignments. Finally I can do

(close *f*)

and in the first shell I see

[bsw2_aln] read 276 sequences/pairs (9660 bp)...
[main] Version: 0.6.1-r104
[main] CMD: bwa bwasw -C 10000 c1215.fasta wibble
[main] Real time: 691.224 sec; CPU: 1.208 sec

that is the incomplete last chunk being processed.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by iw9oel_ad6.0k
1

This looks good too, but it depends on the implementation. If the program starts processing while reading chunks of the input, then this can work. If the program however tries to slurp in all input data before (reading from fifo until EOF is encountered) starting processing, then this will not have the desired effect. I think this approach is good for a test.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k
1

Good point. I tested this and bwa is sensible about the way it processes the queries; bsw2_aln is called approx every 10 Mb of queries (I used a Lisp process on the other end of the FIFO to send blocks of Fastq data interactively and observed the results).

ADD REPLYlink written 7.6 years ago by iw9oel_ad6.0k

@Keith James: can you post the tests you've done. Also, about the "approx every 10 Mb of queries", does that change using a smaller "-s" option?

ADD REPLYlink written 7.6 years ago by 2184687-1231-83-5.0k

What version of bwa are you using? On 0.6.1 the -s flag is "maximum seeding interval size".

ADD REPLYlink written 7.6 years ago by iw9oel_ad6.0k

Okay, tests added...

ADD REPLYlink written 7.6 years ago by iw9oel_ad6.0k

I'd give another +1 for impressive use of emacs-lisp!

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

Most of the work is done in sbcl (Steel Bank Common Lisp - http://www.sbcl.org). Slime is an environment that lets Emacs talk to external Lisps, including Common Lisps like sbcl and other Lisps like Clojure.

ADD REPLYlink written 7.6 years ago by iw9oel_ad6.0k
3
gravatar for Damian Kao
7.6 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You could do it with a shell command. Here is shell scripts that'll send a kill command after set amount of time: http://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/

Why exactly do you want to do this though? Are you renting out time on someone else's server? There really is no guarantee that sequences that take longer to process will be at the beginning. I guess you can try to sort your sequence so longest sequences are at the beginning of the file. But even that is no guarantee depending on complexity of the sequence.

ADD COMMENTlink modified 6.8 years ago by Istvan Albert ♦♦ 80k • written 7.6 years ago by Damian Kao15k
1

I was first thinking about kill but that is not a good idea. Killing the process might end in incomplete write operation of the output, e.g. files not closed properly, missing output, etc.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k
1

No, it might be worth a try (+1). alarm in perl actually sends a SIGALARM after timeout, so all depends on bwa interpreting it correctly (closing, flushing buffers), which is not a kill signal. One could achive the same by using kill -14 <procid> on the commandline i suppose.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

Yeah that's true. Perhaps bin the reads into smaller files and run them one after another. Kill the entire process after 120 minutes. Then remove the last bin of reads because it potentially can contain incomplete files. Then merge the other alignment bams.

ADD REPLYlink written 7.6 years ago by Damian Kao15k
1
gravatar for Michael Dondrup
7.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I could imagine the following script:

  • split the input file into chunks of appropriate size (e.g 1000-10000 sequences)
  • run the chunks through bwa in a loop
  • if time is used up, exit and concatenate the output

This is not very efficient though I think, and might have other effects (e.g. multiple SAM headers which need to be removed). So I would also question why you need this feature. Could it be an alternative to use the -t parameter to allow for parallel threads to decrease wall-clock time instead?

ADD COMMENTlink written 7.6 years ago by Michael Dondrup46k
1

Agreed, try the alarm way then. If bwa is C and includes POSIX.h it could work.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

Thanks @Michael Dondrup for the suggestion. I don't have hard data on this, but I think bwa bwasw would take some initial time in loading the target indexed genome each time, so as you say, it would not be efficient if this is done over and over for each chunk. Maybe someone has some numbers on how long it takes to load up the human genome up until the point where the first sequence is queried...

ADD REPLYlink written 7.6 years ago by 2184687-1231-83-5.0k
1
gravatar for ALchEmiXt
7.6 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

As reading the other peoples comments I agree that the reason WHY you want to do this is rather unclear. A proper/better fitting solution would come up if you could slightly elaborate a bit more on its exact purpose.

My go with this would be (assuming fastq but for fasta it would go as well with some mods) to chunck the input fastq file in lines of 4 (one sequence and quality) and pipe them to align using bwa capturing its output with an incremental output file name. Loop over the complete input file and merge all sam files afterwards. If one alignment fails/halts or takes for ever you only miss the latest sequence in processing. You could even use the bash timing functions to limit it to time.

Setting a 'trap { cmd } EXIT' in bash with the consequtive command/script (cmd) to merge and cleanup/move files will help in any (remote) recovery. If wanted, you can regain speed by speeding-up the process using bash 'PARALLEL' function (you can get it from gnu.org and its really powerfull). Parallelisation power is what you will loose by processing all sequences sequentially I guess.

My 2ct

ADD COMMENTlink written 7.6 years ago by ALchEmiXt1.9k
1
gravatar for earonesty
6.8 years ago by
earonesty220
United States
earonesty220 wrote:

If you're piping output to bwasw ... in a read-producer/consumer model, you should also use the unix "buffer" command.

We ran into this issue while developing a pipeline in which we were generating a continuous stream of long reads, but we wanted them to be alignment-filtered in real-time. bwasw is great for this, but it has to be buffered for paired-end reads... since it a "chunk" of reads from read1 and another "chunk" from read2.

EXAMPLE of using a buffer to accelerate pipes:

mkfifo buffer.in buffer.out
buffer -i buffer.in -o buffer.out &
head -1000 query.fq > buffer.in
bwa bwasw -s 10 -N 120 hg19.prefix buffer.out > output.250.sequences
ADD COMMENTlink modified 6.3 years ago • written 6.8 years ago by earonesty220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1571 users visited in the last hour