Ensure identical output between platforms and operating systems
3
1
Entering edit mode
4 months ago
ATpoint 55k

tl;dr...

This questions is a generalisation of this one and the related GitHub issue:

I am looking for suggestions on what I could investigate to find out what causes differences in R function outputs when running identical code on identical input data with identical software versions on two different machine. Specifically, I am looking for general factors that are not specific to the code of the function I ran.

More specifically:
I was running the R function (sctransform::vst) which runs a variance stabilizing transformation and then reports a per-gene residual variance. I run it on two machines. The first one a Macbook Pro with Mojave and the relevant R package installed into a local user library, and the second one a Skylake node using a Ubuntu-based Singularity image in which I installed the R packages via the renv lock file created from the Macbook user library. Version of R is the same as well. Afaict both input data, software versions and code are identical.

Still I get different outputs differing in the decimal place which have impact on downstream analysis that are based on this.

Please throw me buzzwords on what I could check and investigate to make outputs 100% identical, related to rounding and handling of decimals.

What I checked:

  • I use set.seed() before running the function
  • Machine$double.eps is identical
  • options()$digits is 7 on both machine
  • I set options(scipen=999) on both machine
  • I disabled BLAS and OpenMP implicit multithreading on the Linux node via RhpcBLASctl package

...and please lets not discuss whether decimal differences are important or not etc, this is not the point here ;-)

reproducibility • 708 views
ADD COMMENT
0
Entering edit mode

Maybe the source of the issue(s) is Singularity? So something like what's described in this thread.

ADD REPLY
0
Entering edit mode

I could build the packages outside of the container directly on the remote system to check that, will try.

ADD REPLY
0
Entering edit mode

If you are using R, you have access the source code for the function being run (and any descendant functions called downstream), so it should be straightforward (with some digging) to check if any functions related to random sampling are not using the seed value you are manually applying.

ADD REPLY
4
Entering edit mode
4 months ago
Mensur Dlakic ★ 15k

I don't know exactly about the kind of analysis you are doing, but in many machine learning applications setting the random seed once will not do the trick. For example, in Keras setting the NumPy random seed is not enough, as it uses many other libraries which have their own random seeds. I don't remember exactly since it has been several years, but for Keras one has to set a NumPy, a TensorFlow and at least one other random seed to even hope for reproducibility. Not to mention that this has to be run using a single thread, which is not worth it for most ML applications.

Below is an example of a t-SNE plot that was done on the same dataset and on the same machine, maybe a month apart. By design they do not use the same random seed. It is normal in all embedding runs that their plots will not be identical, but usually they are related by some kind of simple rotation. The two plots were aligned afterwards according to the global transformation, but there are still parts that need to be rotated or flipped locally in order to overlap. For example, the part above 150 on the Y-axis needs to rotate by ~30 degrees clock-wise to overlap, and the part between 100-150 on the Y-axis needs to flip 180 around the vertical axis. Yet there is no doubt that these two embeddings are identical when it comes to subsequent clustering, which is all that matters to me.

There are too many packages involved in calculation of these embeddings, and some may use their own random seeds explicitly. C'est la vie.

enter image description here

ADD COMMENT
3
Entering edit mode
4 months ago
Mensur Dlakic ★ 15k

Random seeds are different?

I have two identical Linux computers that were purchased and put into production a week apart. I saved all commands used to install python, R and ML packages on the first machine, and repeated them verbatim on the second. I update them in parallel, and reboot as well. Whenever I install a package one one of them, I run the same commands on the other. For 3-4 months they appeared identical in all respects, but after a while a package would install without a hiccup on one but not the other. A year-plus into their existence I know for a fact that they are no longer identical even though I still update them simultaneously. The reason I am writing all this is to emphasize that it is almost impossible that two different computers will have identical configurations unless they are literally booted for the first time. In your case that's almost a guarantee because yours could only be identical in terms of software configuration and not in any other way. I have accepted this as a fact: no matter how hard I try to keep the two configurations identical, I have to accept that they are only near-identical.

ADD COMMENT
3
Entering edit mode

Are the servers using the same time server/NTP to keep their time in sync? RTC/System clocks in sync?

ADD REPLY
1
Entering edit mode

Yes, the clock sync happens at the same time in both of them as specified in root's cron file.

00 12 * * 1 ntpdate pool.ntp.org >& /dev/null
ADD REPLY
0
Entering edit mode

Thanks, I can rule out seeds as I use a fixed sed.seed() at least before running the function, just edited that to the question. The function itself calls several subfunctions including some written in C which is then beyond my skill to check the source code. Maybe in the C part there is seeding I cannot control explicitely. Yes, what you say is what I also heard from others in the chat, probably this is what we have to accept. That should probably lead to the imperative to run a given analysis with different parameters and then come up with some kind of method to create a "consensus call", e.g. for clustering single-cell (or similar) data that are strongly influenced by all kinds of parameters, to be sure results are robust.

ADD REPLY
0
Entering edit mode
4 months ago
ATpoint 55k

Ok, I followed the advise of @Alex Reynolds and started dissecting the function and indeed the first float/decimal discrepancy is right at the beginning when the function calculates geometric means from a sparse matrix.

=> this is the relevant line in the R function: https://github.com/ChristophH/sctransform/blob/4342164248424d923a910b5628533db91eed8c17/R/vst.R#L180-L187 (eps is hardcoded 1)

I did use identical set.seed() before calling this function on both machines.

=> this then calls the function row_gmean here: https://github.com/ChristophH/sctransform/blob/4342164248424d923a910b5628533db91eed8c17/R/utils.R#L73-L83

=> and since input is a sparse matrix this then eventually calls this Rcpp function: https://github.com/ChristophH/sctransform/blob/4342164248424d923a910b5628533db91eed8c17/src/RcppExports.cpp#L35-L44

// row_gmean_dgcmatrix
NumericVector row_gmean_dgcmatrix(S4 matrix, double eps);
RcppExport SEXP _sctransform_row_gmean_dgcmatrix(SEXP matrixSEXP, SEXP epsSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< S4 >::type matrix(matrixSEXP);
    Rcpp::traits::input_parameter< double >::type eps(epsSEXP);
    rcpp_result_gen = Rcpp::wrap(row_gmean_dgcmatrix(matrix, eps));
    return rcpp_result_gen;
END_RCPP
}

As my Cpp knowledge is non-existing, can someone tell me whether in this function there are elements that could explain differences in decimals when calculating these geomeans on different machines? Or is this maybe a difference in decimal precision I cannot do anything about?

ADD COMMENT
0
Entering edit mode

At this early stage, to what degree of precision are the results on the two hosts different?

Are you getting entirely different answers, or are differences at the least significant bits of the mantissa?

If you extract and break down arithmetic operations in row_gmean step by step, what operation causes noticeable differences?

Is R being compiled the same way on both hosts? My expectation would be no, especially if using package managers to install R.

Differences in compiler flags at compile-time could potentially result in differences in the intermediate precision of operations on floating-point values.

For example, when one host adds two double values, the underlying code could temporarily store the addition result in a representation with different precision or application of different rounding behavior. That same operation on another host could use other precision or rounding behavior.

Some general discussion here: https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/ and here: https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/

Floating-point arithmetic can be difficult. As other differences go, operations on very small and very large floating-point values can quickly lead to errors. John Cook discusses this problem here, in the context of calculating variance by three different methods: https://www.johndcook.com/blog/2008/09/28/theoretical-explanation-for-numerical-results/

I don't know how deep you would want to dig, but looking at the assembly code level for the operations underlying this should show if and where there are differences, on a per-host basis. If you are compiling R by hand, you can export assembly to inspect manually.

ADD REPLY

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6