Question

Change mounted drive on GCP Batch nextflow process

0

Entering edit mode

8 months ago

dthorbur ★ 1.9k

I've been trying to download and index databases to add to a GCP bucket that I will use as a database to be queried through the fusion file system using nextflow.

However, I keep getting errors with the download process.

Here is the filesystem of the process:

Filesystem     Type         Size  Used Avail Use% Mounted on
overlay        overlay      2.1T   23G  2.1T   2% /
tmpfs          tmpfs         64M     0   64M   0% /dev
shm            tmpfs         64M     0   64M   0% /dev/shm
/dev/nvme0n1   ext4         369G   80K  350G   1% /tmp
/dev/sda1      ext4         2.1T   23G  2.1T   2% /etc/hosts
fusion         fuse.fusion  8.0P  4.0P  4.0P  50% /fusion

But after downloading and decompressing the first database (77Gb compressed), indexing meets this output:

tee: .command.err: No space left on device
cp: error writing '/fusion/gs/colabfoldlocal-db/workDir/5d/1b84eafba70b82d61c2438a05b120f/.command.out': Input/output error
/fusion/gs/colabfoldlocal-db/workDir/5d/1b84eafba70b82d61c2438a05b120f/.command.run: line 269: printf: write error: Input/output error

By this point, there should be less than 500Gb worth of data on the drive, so my guess is everything is loaded on the nvme0n1 drive. So, I tried to moving to the root directory and rerunning the script. But now I don't even get an error, the log files aren't sent to the workDir and are lost when the process dies. The stdout is long enough that the message is trunated in the .nextflow.log file.

Ultimately, I'm trying to adapt this repo to a nextflow workflow on GCP. I know the nf-core/proteinfold exists, but we've had a lot of bad experiences with managed workflows so far, so it's time to set up our own.

I have both wave and fusion enabled in the config.

Here is the process:

#!/usr/bin/env nextflow

process DOWNLOAD {
  maxForks 1
  container 'dthorbur1990/localcolabfold:v0.4'

  cpus params.DBD_cpus
  memory params.DBD_memory
  disk params.DBD_disk

  publishDir(
    path: "${params.DB_bucket}",
    mode: 'move',
  )

  input:
//    path(setup_databases_script)
//    path(DB_bucket)

  output:
    path("${params.DB_downdate}/*")

  script:
  """
  df -Th

  wd1=${params.DB_downdate}
  mkdir \$wd1  

  ## just checking the command works. 
  mmseqs -h

  # Setup everything for using mmseqs locally
  ARIA_NUM_CONN=8
  WORKDIR="\${wd1:-\$(pwd)}"

  PDB_SERVER="\${2:-"rsync.wwpdb.org::ftp"}"
  PDB_PORT="\${3:-"33444"}"

  cd "\${WORKDIR}"
  echo "\${WORKDIR}"

  hasCommand () {
      command -v "\$1" >/dev/null 2>&1
  }

  STRATEGY=""
  if hasCommand aria2c; then STRATEGY="\$STRATEGY ARIA"; fi
  if hasCommand curl;   then STRATEGY="\$STRATEGY CURL"; fi
  if hasCommand wget;   then STRATEGY="\$STRATEGY WGET"; fi
  if [ "\$STRATEGY" = "" ]; then
        fail "No download tool found in PATH. Please install aria2c, curl or wget."
  fi

  downloadFile() {
      URL="\$1"
      OUTPUT="\$2"
      set +e
      for i in \$STRATEGY; do
          case "\$i" in
          ARIA)
              FILENAME=\$(basename "\${OUTPUT}")
              DIR=\$(dirname "\${OUTPUT}")
              aria2c --max-connection-per-server="\$ARIA_NUM_CONN" --allow-overwrite=true -o "\$FILENAME" -d "\$DIR" "\$URL" && set -e && return 0
              ;;
          CURL)
              curl -L -o "\$OUTPUT" "\$URL" && set -e && return 0
              ;;
          WGET)
              wget -O "\$OUTPUT" "\$URL" && set -e && return 0
              ;;
          esac
      done
      set -e
      fail "Could not download \$URL to \$OUTPUT"
  }

  echo "Starting DB download: `date`"
  # Make MMseqs2 merge the databases to avoid spamming the folder with files
  export MMSEQS_FORCE_MERGE=1

  if [ ! -f UNIREF30_READY ]; then
    downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2202.tar.gz" "uniref30_2202.tar.gz"
    tar xzvf "uniref30_2202.tar.gz"
    mmseqs tsv2exprofiledb "uniref30_2202" "uniref30_2202_db"
    mmseqs createindex "uniref30_2202_db" tmp1 --remove-tmp-files 1
    if [ -e uniref30_2202_db_mapping ]; then
      ln -sf uniref30_2202_db_mapping uniref30_2202_db.idx_mapping
    fi
    if [ -e uniref30_2202_db_taxonomy ]; then
      ln -sf uniref30_2202_db_taxonomy uniref30_2202_db.idx_taxonomy
    fi
    touch UNIREF30_READY
  fi

  if [ ! -f COLABDB_READY ]; then
    downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz" "colabfold_envdb_202108.tar.gz"
    tar xzvf "colabfold_envdb_202108.tar.gz"
    mmseqs tsv2exprofiledb "colabfold_envdb_202108" "colabfold_envdb_202108_db"
    # TODO: split memory value for createindex?
    mmseqs createindex "colabfold_envdb_202108_db" tmp2 --remove-tmp-files 1
    touch COLABDB_READY
  fi

  if [ ! -f PDB_READY ]; then
    downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/pdb70_220313.fasta.gz" "pdb70_220313.fasta.gz"
    mmseqs createdb pdb70_220313.fasta.gz pdb70_220313
    mmseqs createindex pdb70_220313 tmp3 --remove-tmp-files 1
    touch PDB_READY
  fi

  if [ ! -f PDB70_READY ]; then
    downloadFile "https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_from_mmcif_220313.tar.gz" "pdb70_from_mmcif_220313.tar.gz"
    tar xzvf pdb70_from_mmcif_220313.tar.gz pdb70_a3m.ffdata pdb70_a3m.ffindex
    touch PDB70_READY
  fi
  if [ ! -f PDB_MMCIF_READY ]; then
    mkdir -p pdb/divided
    mkdir -p pdb/obsolete
    rsync -rlpt -v -z --delete --port=\${PDB_PORT} \${PDB_SERVER}/data/structures/divided/mmCIF/ pdb/divided
    rsync -rlpt -v -z --delete --port=\${PDB_PORT} \${PDB_SERVER}/data/structures/obsolete/mmCIF/ pdb/obsolete
    touch PDB_MMCIF_READY
  fi
  """
}

So my question is how do I force everything to run on the sda1 drive instead?

google-cloud-platform unix nextflow • 527 views

ADD COMMENT • link 8 months ago by dthorbur ★ 1.9k

0

Entering edit mode

what is the value of params.DBD_disk ? could you please add a 'set -x' to see what's happening ? also, why the set +e ?

ADD REPLY • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

The value of params.DBD_disk is 2000 GB and the script below the first few lines is just taken from the localcolabfold repo. I have only added \ to the $ characters. I will change the set value to -x and report back.

ADD REPLY • link 8 months ago by dthorbur ★ 1.9k