bwa mem runs slowly the first time
2
0
Entering edit mode
7.5 years ago

Hi there,

I'm using bwa mem for alignment but I noticed this behavior which I don't fully understand:

I created large aws ec2 instances (60G RAM and 32 cores), installed bwa, downloaded the human reference (=3G), indexed it, the bwa command is very straightforward:  bwa mem human_ref.fasta input.fq

What happens is that the first time I use the command it takes a long time. It doesn't output anything onto stdout or stderr but I noticed it is loading something in the RAM (I think its the reference index), after it loads 5G or so, bwa "runs" fast enough with respect to the small input size (in megas). This Scenario only happens the first time I run the command on the machine, any runs after that don't take such time and finish reasonably fast.

Is this is normal? Why doesn't bwa take such long time after the first run?

Any help is appreciated.

Thanks,
Shazly

bwa mem slow ec2 alignment • 5.5k views
3
Entering edit mode

I think you are right about the loading of the reference index into memory. Also, I seem to recall that the system bwa uses for  memory mapping of the index allows it to be reused in subsequent runs. That's why things get faster after the first run. If you load your memory with something else between two runs, your second run should be slow, too.

0
Entering edit mode
7.5 years ago
donfreed ★ 1.6k

Before aligning reads, bwa must generate an index file (an FMD-index of the reference genome). The first time you run the command, the index is generated but subsequent runs can use the previously generated index file.

0
Entering edit mode

Thanks for the reply but Is this file saved in a tmp directory or something? I didn't find neither in the reference directory or the working directory.

0
Entering edit mode

The index files should be in the same directory as the reference. They should have the same base as the reference, but should also have additional extensions.

For example, if your genome is human_g1k_v37.fasta. Bwa will generate human_g1k_v37.fasta.bwt and additional files.

0
Entering edit mode

Actually I incorrectly assumed BWA would generate the index files automatically if they are not present. I just checked and it will not, so I have no idea why BWA would run more slowly for the first run and more quickly on subsequent runs.

0
Entering edit mode

Your data is probably cached somewhere after calling it for the first time. So the second time you reference the data, it is retrieved from the cache and not your HD. Thus faster.

0
Entering edit mode
7.5 years ago

HI- I seem to confirm what the OP refers to and what @thackl suggests in his/her comment:

Align a dummy sequence file with one read to mouse reference genome:

# First run:
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m8.466s
user    0m0.127s
sys    0m6.410s

# Second run
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m2.282s
user    0m0.116s
sys    0m2.129s

# Third run:
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m2.169s
user    0m0.100s
sys    0m2.041s

I tried on a couple of different nodes and the picture stays the same: first run ~4x slower then following runs.

0
Entering edit mode

Your data is probably cached somewhere after calling it for the first time. So the second time you reference the data, it is retrieved from the cache and not your HD. Thus faster.