Reproducibility problem for BBduk between local server and HPC
Entering edit mode
20 months ago
DriesB ▴ 90

Hi all,

Can anyone explain the differences in output I am getting when running BBduk on my local server and on the High Performance Computer?

See my comparative shell script below, including observations.

# These tests were run in directory $trimming/bbduk/cluster, where results from HPC are compared to local server.
# In "{,../}", "" thus stands for HPC and "../" stands for local server.

## Compare file sizes

ls -l $(realpath ../test_paired_R[12].fq.gz) > compare_ll.txt

ssh user@HPC "ls -l $trimming/bbduk/test_paired_R[12].fq.gz" >> compare_ll.txt

#> Unexpectedly, file sizes differ.

## Compare logs

scp user@HPC:$trimming/bbduk/test_paired.log ./

diff --side-by-side --width=$COLUMNS {,../}test_paired.log >
#> No change in Results.

## Perhaps due to zipping?

scp user@HPC:$trimming/bbduk/test_paired_R[12].fq.gz ./

for gz in {,../}test_paired*.fq.gz; do gunzip -v -c $gz > ${gz%.gz} & done
#> Quick unzipping script

ll {,../}test_paired_R[12].fq > compare_ll_unzipped.txt
wc -c {,../}test_paired_R[12].fq
#> Now all have same size?!

less {,../}test_paired*.fq
#> R1s look identical, but differ from R2s.

cmp {,../}test_paired_R1.fq
cmp {,../}test_paired_R2.fq
#> Both R1s and R2s differ within group around line 800.

#>> Why do they have the same size?!
#>> Why do they differ, but differ so subtly?!

My colleagues have proposed some explanations for the zip sizes differing between local server and HPC, such as different block sizes and different compressing options. I'm mostly interested in the difference between the output however. Does BBduk incoorporate a non-deterministic component or is there something else at play here?

If someone has a solution or a proposed analysis for answering this question, please let me know!

reproducibility HPC bbmap • 432 views
Entering edit mode

Brian Bushnell and genomax, do you maybe have suggestions for solving this?

Entering edit mode

Exactly the same version of bbtools on both machines?

Entering edit mode

Yes, 38,79 I think. However, the conda environment differed in other aspects. So same software version, but not identical conda environment.

Entering edit mode
20 months ago
GenoMax 109k

File sizes are never a good measure for comparing datasets. I suggest that you compare the statistics generated by in both instances. Which look to be identical, correct?

Most NGS programs have a non-deterministic component especially if you are using multiple cores. I would not think bbduk output should be affected by this though. You could also add

ordered=t           Set to true to output reads in same order as input.

this will ensure that the output data files always are in the same order and should get compressed to same extent.

Finally this may simply be how data is stored on particular storage systems. Files of the same size may occupy different amount of storage because of differences in sector size, overheads etc. Using du --apparent-size filename is one way to get around that.

Entering edit mode

Thank you! Adding the ordered=t to the next test gave identical results on the local server and HPC.

The compressed file sizes are now also identical. I know that they are no guarantees for identical files, but it was the first thing that caught my attention.


Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6