I've been trying to download and index databases to add to a GCP bucket that I will use as a database to be queried through the fusion file system using nextflow.
However, I keep getting errors with the download process.
Here is the filesystem of the process:
Filesystem Type Size Used Avail Use% Mounted on
overlay overlay 2.1T 23G 2.1T 2% /
tmpfs tmpfs 64M 0 64M 0% /dev
shm tmpfs 64M 0 64M 0% /dev/shm
/dev/nvme0n1 ext4 369G 80K 350G 1% /tmp
/dev/sda1 ext4 2.1T 23G 2.1T 2% /etc/hosts
fusion fuse.fusion 8.0P 4.0P 4.0P 50% /fusion
But after downloading and decompressing the first database (77Gb compressed), indexing meets this output:
tee: .command.err: No space left on device
cp: error writing '/fusion/gs/colabfoldlocal-db/workDir/5d/1b84eafba70b82d61c2438a05b120f/.command.out': Input/output error
/fusion/gs/colabfoldlocal-db/workDir/5d/1b84eafba70b82d61c2438a05b120f/.command.run: line 269: printf: write error: Input/output error
By this point, there should be less than 500Gb worth of data on the drive, so my guess is everything is loaded on the nvme0n1
drive. So, I tried to moving to the root directory and rerunning the script. But now I don't even get an error, the log files aren't sent to the workDir
and are lost when the process dies. The stdout is long enough that the message is trunated in the .nextflow.log
file.
Ultimately, I'm trying to adapt this repo to a nextflow workflow on GCP. I know the nf-core/proteinfold
exists, but we've had a lot of bad experiences with managed workflows so far, so it's time to set up our own.
I have both wave
and fusion
enabled in the config.
Here is the process:
#!/usr/bin/env nextflow
process DOWNLOAD {
maxForks 1
container 'dthorbur1990/localcolabfold:v0.4'
cpus params.DBD_cpus
memory params.DBD_memory
disk params.DBD_disk
publishDir(
path: "${params.DB_bucket}",
mode: 'move',
)
input:
// path(setup_databases_script)
// path(DB_bucket)
output:
path("${params.DB_downdate}/*")
script:
"""
df -Th
wd1=${params.DB_downdate}
mkdir \$wd1
## just checking the command works.
mmseqs -h
# Setup everything for using mmseqs locally
ARIA_NUM_CONN=8
WORKDIR="\${wd1:-\$(pwd)}"
PDB_SERVER="\${2:-"rsync.wwpdb.org::ftp"}"
PDB_PORT="\${3:-"33444"}"
cd "\${WORKDIR}"
echo "\${WORKDIR}"
hasCommand () {
command -v "\$1" >/dev/null 2>&1
}
STRATEGY=""
if hasCommand aria2c; then STRATEGY="\$STRATEGY ARIA"; fi
if hasCommand curl; then STRATEGY="\$STRATEGY CURL"; fi
if hasCommand wget; then STRATEGY="\$STRATEGY WGET"; fi
if [ "\$STRATEGY" = "" ]; then
fail "No download tool found in PATH. Please install aria2c, curl or wget."
fi
downloadFile() {
URL="\$1"
OUTPUT="\$2"
set +e
for i in \$STRATEGY; do
case "\$i" in
ARIA)
FILENAME=\$(basename "\${OUTPUT}")
DIR=\$(dirname "\${OUTPUT}")
aria2c --max-connection-per-server="\$ARIA_NUM_CONN" --allow-overwrite=true -o "\$FILENAME" -d "\$DIR" "\$URL" && set -e && return 0
;;
CURL)
curl -L -o "\$OUTPUT" "\$URL" && set -e && return 0
;;
WGET)
wget -O "\$OUTPUT" "\$URL" && set -e && return 0
;;
esac
done
set -e
fail "Could not download \$URL to \$OUTPUT"
}
echo "Starting DB download: `date`"
# Make MMseqs2 merge the databases to avoid spamming the folder with files
export MMSEQS_FORCE_MERGE=1
if [ ! -f UNIREF30_READY ]; then
downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2202.tar.gz" "uniref30_2202.tar.gz"
tar xzvf "uniref30_2202.tar.gz"
mmseqs tsv2exprofiledb "uniref30_2202" "uniref30_2202_db"
mmseqs createindex "uniref30_2202_db" tmp1 --remove-tmp-files 1
if [ -e uniref30_2202_db_mapping ]; then
ln -sf uniref30_2202_db_mapping uniref30_2202_db.idx_mapping
fi
if [ -e uniref30_2202_db_taxonomy ]; then
ln -sf uniref30_2202_db_taxonomy uniref30_2202_db.idx_taxonomy
fi
touch UNIREF30_READY
fi
if [ ! -f COLABDB_READY ]; then
downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz" "colabfold_envdb_202108.tar.gz"
tar xzvf "colabfold_envdb_202108.tar.gz"
mmseqs tsv2exprofiledb "colabfold_envdb_202108" "colabfold_envdb_202108_db"
# TODO: split memory value for createindex?
mmseqs createindex "colabfold_envdb_202108_db" tmp2 --remove-tmp-files 1
touch COLABDB_READY
fi
if [ ! -f PDB_READY ]; then
downloadFile "https://wwwuser.gwdg.de/~compbiol/colabfold/pdb70_220313.fasta.gz" "pdb70_220313.fasta.gz"
mmseqs createdb pdb70_220313.fasta.gz pdb70_220313
mmseqs createindex pdb70_220313 tmp3 --remove-tmp-files 1
touch PDB_READY
fi
if [ ! -f PDB70_READY ]; then
downloadFile "https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_from_mmcif_220313.tar.gz" "pdb70_from_mmcif_220313.tar.gz"
tar xzvf pdb70_from_mmcif_220313.tar.gz pdb70_a3m.ffdata pdb70_a3m.ffindex
touch PDB70_READY
fi
if [ ! -f PDB_MMCIF_READY ]; then
mkdir -p pdb/divided
mkdir -p pdb/obsolete
rsync -rlpt -v -z --delete --port=\${PDB_PORT} \${PDB_SERVER}/data/structures/divided/mmCIF/ pdb/divided
rsync -rlpt -v -z --delete --port=\${PDB_PORT} \${PDB_SERVER}/data/structures/obsolete/mmCIF/ pdb/obsolete
touch PDB_MMCIF_READY
fi
"""
}
So my question is how do I force everything to run on the sda1
drive instead?
what is the value of params.DBD_disk ? could you please add a 'set -x' to see what's happening ? also, why the
set +e
?The value of
params.DBD_disk
is2000 GB
and the script below the first few lines is just taken from the localcolabfold repo. I have only added \ to the $ characters. I will change the set value to -x and report back.