Question

Cromwell slurm singularity call caching

0

Entering edit mode

4 weeks ago

asmariyaz23 ▴ 10

Hello

I have been using singularity within cromwell conf and now want to enable call-caching. I do not have internet on any of the cluster nodes. However the call-cache is still turned off. Is it because cromwell cannot pull the docker image (there is not internet on any of the nodes)?.

Here is the cromwell.conf.

    # This line is required. It pulls in default overrides from the embedded cromwell
    # `reference.conf` (in core/src/main/resources) needed for proper performance of cromwell.
include required(classpath("application"))

# Cromwell HTTP server settings
webservice {
  #port = 8000
  #interface = 0.0.0.0
  #binding-timeout = 5s
  #instance.name = "reference"
}

# Cromwell "system" settings
system {
  # If 'true', a SIGINT will trigger Cromwell to attempt to abort all currently running jobs before exiting
  #abort-jobs-on-terminate = false

  # If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
  # in particular clearing up all queued database writes before letting the JVM shut down.
  # The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
  #graceful-server-shutdown = true

  # Cromwell will cap the number of running workflows at N
  #max-concurrent-workflows = 5000

  # Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow slots exist
  #max-workflow-launch-count = 50

  # Number of seconds between workflow launches
  #new-workflow-poll-rate = 20

  # Since the WorkflowLogCopyRouter is initialized in code, this is the number of workers
  #number-of-workflow-log-copy-workers = 10

  # Default number of cache read workers
  #number-of-cache-read-workers = 25

  io {
    # throttle {
    # # Global Throttling - This is mostly useful for GCS and can be adjusted to match
    # # the quota availble on the GCS API
    # #number-of-requests = 100000
    # #per = 100 seconds
    # }

    # Number of times an I/O operation should be attempted before giving up and failing it.
    #number-of-attempts = 5
  }

  # Maximum number of input file bytes allowed in order to read each type.
  # If exceeded a FileSizeTooBig exception will be thrown.
  input-read-limits {
    #lines = 128000
    #bool = 7
    #int = 19
    #float = 50
    #string = 128000
    #json = 128000
    #tsv = 128000
    #map = 128000
    #object = 128000
  }

  abort {
    # These are the default values in Cromwell, in most circumstances there should not be a need to change them.

    # How frequently Cromwell should scan for aborts.
    scan-frequency: 30 seconds

    # The cache of in-progress aborts. Cromwell will add entries to this cache once a WorkflowActor has been messaged to abort.
    # If on the next scan an 'Aborting' status is found for a workflow that has an entry in this cache, Cromwell will not ask
    # the associated WorkflowActor to abort again.
    cache {
      enabled: true
      # Guava cache concurrency.
      concurrency: 1
      # How long entries in the cache should live from the time they are added to the cache.
      ttl: 20 minutes
      # Maximum number of entries in the cache.
      size: 100000
    }
  }

  # Cromwell reads this value into the JVM's `networkaddress.cache.ttl` setting to control DNS cache expiration
  dns-cache-ttl: 3 minutes
}

docker {
  hash-lookup {
    # Set this to match your available quota against the Google Container Engine API
    #gcr-api-queries-per-100-seconds = 1000

    # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
    #cache-entry-ttl = "20 minutes"

    # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
    #cache-size = 200

    # How should docker hashes be looked up. Possible values are "local" and "remote"
    # "local": Lookup hashes on the local docker daemon using the cli
    # "remote": Lookup hashes on docker hub, gcr, gar, quay
    #method = "remote"
    enabled = "false"
  }
}

# Here is where you can define the backend providers that Cromwell understands.
# The default is a local provider.
# To add additional backend providers, you should copy paste additional backends
# of interest that you can find in the cromwell.example.backends folder
# folder at https://www.github.com/broadinstitute/cromwell
# Other backend providers include SGE, SLURM, Docker, udocker, Singularity. etc.
# Don't forget you will need to customize them for your particular use case.
backend {
  # Override the default backend.
  default = slurm

  # The list of providers.
  providers {
    # Copy paste the contents of a backend provider in this section
    # Examples in cromwell.example.backends include:
    # LocalExample: What you should use if you want to define a new backend provider
    # AWS: Amazon Web Services
    # BCS: Alibaba Cloud Batch Compute
    # TES: protocol defined by GA4GH
    # TESK: the same, with kubernetes support
    # Google Pipelines, v2 (PAPIv2)
    # Docker
    # Singularity: a container safe for HPC
    # Singularity+Slurm: and an example on Slurm
    # udocker: another rootless container solution
    # udocker+slurm: also exemplified on slurm
    # HtCondor: workload manager at UW-Madison
    # LSF: the Platform Load Sharing Facility backend
    # SGE: Sun Grid Engine
    # SLURM: workload manager

    # Note that these other backend examples will need tweaking and configuration.
    # Please open an issue https://www.github.com/broadinstitute/cromwell if you have any questions
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        # Root directory where Cromwell writes job results in the container. This value
        # can be used to specify where the execution folder is mounted in the container.
        # it is used for the construction of the docker_cwd string in the submit-docker
        # value above.
        dockerRoot = "/cromwell-executions"

        concurrent-job-limit = 10
        # If an 'exit-code-timeout-seconds' value is specified:
        #     - check-alive will be run at this interval for every job
        #     - if a job is found to be not alive, and no RC file appears after this interval
        #     - Then it will be marked as Failed.
        ## Warning: If set, Cromwell will run 'check-alive' for every job at this interval
        exit-code-timeout-seconds = 360
        filesystems {
         local {
           localization: [
             # soft link does not work for docker with --contain. Hard links won't work
             # across file systems
             "copy", "hard-link", "soft-link"
           ]
            caching {
                  duplication-strategy: ["copy", "hard-link", "soft-link"]
                  hashing-strategy: "xxh64"
            }
         }
        }



        #
        runtime-attributes = """
        Int runtime_minutes = 600
        Int cpus = 2
        Int requested_memory_mb_per_core = 30000
        String? docker
        String? account
        String? IMAGE
        """

        submit = """
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${out} \
              --error=${err} \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --account=${account} \
              --wrap "/bin/bash ${script}"
        """

        submit-docker = """
            # SINGULARITY_CACHEDIR needs to point to a directory accessible by
            # the jobs (i.e. not lscratch). Might want to use a workflow local
            # cache dir like in run.sh
            source /scratch/asmab/set_singularity_cachedir.sh
            SINGULARITY_CACHEDIR=/scratch/asmab/singularity-cache
            echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            if [ -z $SINGULARITY_CACHEDIR ]; then
                CACHE_DIR=$HOME/.singularity
            else
                CACHE_DIR=$SINGULARITY_CACHEDIR
            fi
            mkdir -p $CACHE_DIR
                        echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            LOCK_FILE=$CACHE_DIR/singularity_pull_flock

            # we want to avoid all the cromwell tasks hammering each other trying
            # to pull the container into the cache for the first time. flock works
            # on GPFS, netapp, and vast (of course only for processes on the same
            # machine which is the case here since we're pulling it in the master
            # process before submitting).
            #flock --exclusive --timeout 1200 $LOCK_FILE \
            #    singularity exec --containall docker://${docker} \
            #    echo "successfully pulled ${docker}!" &> /dev/null

            # Ensure singularity is loaded if it's installed as a module
            #module load Singularity/3.0.1
            module load apptainer/1.2.4
            # Build the Docker image into a singularity image
            IMAGE=${docker}.sif
            apptainer build $IMAGE docker://${docker}

            # Submit the script to SLURM
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${cwd}/execution/stdout \
              --error=${cwd}/execution/stderr \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --account=${account} \
              --wrap "apptainer exec --containall --bind ${cwd}:${docker_cwd} ${IMAGE} ${job_shell} ${docker_script}"
        """

        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }

  }
}

call-caching {
    enabled = true
    invalidate-bad-cache-results = true
  }

cromwell singularity wdl • 183 views

ADD COMMENT • link 4 weeks ago by asmariyaz23 ▴ 10