Question: Reproducibility problem for BBduk between local server and HPC
0
gravatar for DriesB
8 weeks ago by
DriesB30
Leiden, The Netherlands
DriesB30 wrote:

Hi all,

Can anyone explain the differences in output I am getting when running BBduk on my local server and on the High Performance Computer?

See my comparative shell script below, including observations.


# These tests were run in directory $trimming/bbduk/cluster, where results from HPC are compared to local server.
# In "{,../}", "" thus stands for HPC and "../" stands for local server.

## Compare file sizes

ls -l $(realpath ../test_paired_R[12].fq.gz) > compare_ll.txt

ssh user@HPC "ls -l $trimming/bbduk/test_paired_R[12].fq.gz" >> compare_ll.txt

#> Unexpectedly, file sizes differ.


## Compare logs

scp user@HPC:$trimming/bbduk/test_paired.log ./

diff --side-by-side --width=$COLUMNS {,../}test_paired.log > diff_logs.sh
#> No change in Results.



## Perhaps due to zipping?

scp user@HPC:$trimming/bbduk/test_paired_R[12].fq.gz ./

for gz in {,../}test_paired*.fq.gz; do gunzip -v -c $gz > ${gz%.gz} & done
#> Quick unzipping script

ll {,../}test_paired_R[12].fq > compare_ll_unzipped.txt
wc -c {,../}test_paired_R[12].fq
#> Now all have same size?!

less {,../}test_paired*.fq
#> R1s look identical, but differ from R2s.

cmp {,../}test_paired_R1.fq
cmp {,../}test_paired_R2.fq
#> Both R1s and R2s differ within group around line 800.

#>> Why do they have the same size?!
#>> Why do they differ, but differ so subtly?!

My colleagues have proposed some explanations for the zip sizes differing between local server and HPC, such as different block sizes and different compressing options. I'm mostly interested in the difference between the output however. Does BBduk incoorporate a non-deterministic component or is there something else at play here?

If someone has a solution or a proposed analysis for answering this question, please let me know!

reproducibility bbmap hpc • 104 views
ADD COMMENTlink written 8 weeks ago by DriesB30

Brian Bushnell and genomax, do you maybe have suggestions for solving this?

ADD REPLYlink written 8 weeks ago by DriesB30

Exactly the same version of bbtools on both machines?

ADD REPLYlink written 8 weeks ago by ATpoint35k

Yes, 38,79 I think. However, the conda environment differed in other aspects. So same software version, but not identical conda environment.

ADD REPLYlink written 8 weeks ago by DriesB30
1
gravatar for genomax
8 weeks ago by
genomax84k
United States
genomax84k wrote:

File sizes are never a good measure for comparing datasets. I suggest that you compare the statistics generated by bbduk.sh in both instances. Which look to be identical, correct?

Most NGS programs have a non-deterministic component especially if you are using multiple cores. I would not think bbduk output should be affected by this though. You could also add

ordered=t           Set to true to output reads in same order as input.

this will ensure that the output data files always are in the same order and should get compressed to same extent.

Finally this may simply be how data is stored on particular storage systems. Files of the same size may occupy different amount of storage because of differences in sector size, overheads etc. Using du --apparent-size filename is one way to get around that.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by genomax84k

Thank you! Adding the ordered=t to the next test gave identical results on the local server and HPC.

The compressed file sizes are now also identical. I know that they are no guarantees for identical files, but it was the first thing that caught my attention.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by DriesB30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1981 users visited in the last hour