bwa aln calculate SA coordinate issue
9.2 years ago
Crystal ▴ 70

Hi All,

I'm using bwa to align my metagenomic data to one bovine DNA sequence.I index bovine DNA sequence and run

bwa aln index_file input.fastq >output.fastq

It do gave me the sai file, but it took really really really long time to finish the whole process (~8 hours per fastq file).

I used bwa aln for several times, it never took this long.

This is what I saw:

[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 310.66 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 294.22 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 289.02 sec
[bwa_aln_core] write to the disk... 0.05 sec

It seems most time was spending on calculate SA coordinate (~5 minutes per), but in the past in only took ~0.3 sec to calculate SA coordinate.

What issue may be for this process (memory? computer RAM size? internet?)


bwa-aln • 4.5k views
Have you tried using more than one thread?

Hi Devon,

No, I never tried that before because it works pretty fast in the past.

How can I specify this in the code?


9.1 years ago
mark.ziemann ★ 1.9k

It might be running slower because you're querying a larger genome.

bwa aln help page says that the -t option allows you to select more than 1 processor.

On a 4 core machine:

bwa aln -t 4 index_file input.fastq >output.fastq

Will run 3-4 times faster.

Usage:   bwa aln [options] <prefix> <in.fq>

Options: -n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
         -o INT    maximum number or fraction of gap opens [1]
         -e INT    maximum number of gap extensions, -1 for disabling long gaps [-1]
         -i INT    do not put an indel within INT bp towards the ends [5]
         -d INT    maximum occurrences for extending a long deletion [10]
         -l INT    seed length [32]
         -k INT    maximum differences in the seed [2]
         -m INT    maximum entries in the queue [2000000]
         -t INT    number of threads [1]
         -M INT    mismatch penalty [3]
         -O INT    gap open penalty [11]
         -E INT    gap extension penalty [4]
         -R INT    stop searching when there are >INT equally best hits [30]
         -q INT    quality threshold for read trimming down to 35bp [0]
         -f FILE   file to write output to instead of stdout
         -B INT    length of barcode
         -L        log-scaled gap penalty for long deletions
         -N        non-iterative mode: search for all n-difference hits (slooow)
         -I        the input is in the Illumina 1.3+ FASTQ-like format
         -b        the input read file is in the BAM format
         -0        use single-end reads only (effective with -b)
         -1        use the 1st read in a pair (effective with -b)
         -2        use the 2nd read in a pair (effective with -b)
         -Y        filter Casava-filtered sequences

