Building a BLAST database with the python subprocess module
1
0
Entering edit mode
2.3 years ago

Hi All!

I am trying to build a blast database using an input fasta file. I am doing this with subprocess, rather than from the command line, as I'm making a reciprocal best hit pipeline. I have a pre-computed BLAST+ makeblastdb database against which the user's input sequences are blasted, and then I need all of the input sequences to be turned into a database in turn, against which the best hits of the previous BLAST run can be BLASTed (i.e. it needs to take whatever the user inputs, on a case by case basis, programatically). Sequences that 'hit' and/or don't 'hit' each other are then labelled appropriately.

Subprocess code:

import subprocess

def run_process(cmd):
if type(cmd)==list:
shell_bool=False
elif type(cmd)==str:
shell_bool=True
else:
try:
cmd = subprocess.run(cmd, check=True, capture_output=True, shell=shell_bool)
return cmd
except FileNotFoundError as e1:
print ('FileNotFoundError:')
print ('\n', e1)
return e1
except subprocess.CalledProcessError as e2:
print ('CalledProcessError:')
print ('\n',e2)
print ('\n',e2.stderr)
print ('\n',e2.stdout)
return e2

db_type_str=' -dbtype prot'
makeblastdb_path= r'"C:\Users\u03132tk\.spyder-py3\modulesData\NCBI\blast-2.10.1+\bin\makeblastdb.exe"'
fasta_db_path= r' -in "C:\Users\u03132tk\.spyder-py3\modulesData\fasta_sequences_SMCOG_efetch_only.txt"'

cmd_str=makeblastdb_path + fasta_db_path + db_type_str
#std_err = Error: mdb_env_open: There is not enough space on the disk.

cmd1=[makeblastdb_path, fasta_db_path, db_type_str]
#-->[WinError 2] The system cannot find the file specified

cmd2=[makeblastdb_path + fasta_db_path + db_type_str]
#-->PermissionError: [WinError 5] Access is denied

test=run_process(cmd_str)
#or cmd1 or cmd2


I've been looking into subprocess and think my code should work, but whilst I've tried inputting the command as a string (cmd_str) and as a list (cmd), and for cmd values either as one big string ([makeblastdb_path + fasta_db_path + db_type_str]) or as separate arguments ([makeblastdb_path, fasta_db_path, db_type_str]), neither work.

I've most progress using cmd_str - it does seem to start running makeblastdb. However, I get CalledProcessError with return code 255, and stderr is Error: mdb_env_open: There is not enough space on the disk. Whilst this seems like a clear cut issue, when I copy cmd_str into command line it works fine, suggesting disk space isnt the problem. Any ideas why this is?

I have already made the BLASTDB_LMDB_MAP_SIZE=1000000 environment variable as discussed here makeblastdb Fasta file with 25 sequences gives Error: mdb_env_open: There is not enough space on the disk, but I don't think this is the issue, due to my successful command line runs and the different environment variables (mdb_env_open vs BLASTDB_LMDB_MAP_SIZE).

Tim

EDIT 1 - Working code:

import subprocess
import os

def run_process(cmd):
if type(cmd)==list:
shell_bool=False
elif type(cmd)==str:
shell_bool=True
else:
envp = {
**os.environ,
'BLASTDB_LMDB_MAP_SIZE':'1000000',
}
print (envp)
try:
cmd = subprocess.run(cmd, check=True, capture_output=True, shell=shell_bool, env=envp)
return cmd
except FileNotFoundError as e1:
print ('FileNotFoundError:')
print ('\n', e1)
return e1
except subprocess.CalledProcessError as e2:
print ('CalledProcessError:')
print ('\n',e2)
print ('\n',e2.stderr)
print ('\n',e2.stdout)
return e2

db_type_str='prot'
makeblastdb_path= r'C:\Users\u03132tk\.spyder-py3\modulesData\NCBI\blast-2.10.1+\bin\makeblastdb.exe'
fasta_db_path= r'C:\Users\u03132tk\.spyder-py3\modulesData\fasta_sequences_SMCOG_efetch_only.txt'

cmd1=[makeblastdb_path, '-in', fasta_db_path, '-dbtype', db_type_str]
test=run_process(cmd1)

python3.7 subprocess makeblastdb • 1.1k views
2
Entering edit mode
2.3 years ago

Hi, I'm not an expert for python subprocess module on windows, so take my suggestions with grain of salt. However, it seems to me, that your usage of subprocess module is wierd. There are two modes

1) you build a string and pass shell=True

2) you build a list of parameters like this [exepath, var1, val_var1, var2 ....] and pass shell=False (default)

There may be problems for either of the options. First, I would start with removing the capture_output=True, as that leads to opening write buffer, which may cause problems.

You talk about the BLASTDB_LMDB_MAP_SIZE env var, but have you verified, if that variable is set in env in which you run the subprocess (this may not be the same one as for your terminal), note, that custom env, can be passed to subprocess.run with env parameter.

I usually use the option 2) because you bypass the shell (safer, more reproducible).

I've used makeblastdb many times on linux, even from python subprocess and I didn't have problems with insufficient space. You also could try to set output path to some drive that has for sure enough space (didn't you by any chance tried to write to some limited logical volume, which was by chance your cwd)?

0
Entering edit mode

Hi Massa (you legend, this has been bugging me for hours),

Thanks for your reply! I looked into environments (didn't really know what these were) and you're right - the environment BLASTDB_LMDB_MAP_SIZE wasn't updated. I updated the code above with the environment variable and all seems well. I then sorted out the formatting of cmd1 (I had spaces and speech marks that weren't necessary, subprocess was doing that formatting) and the list input without shell also works. The working code has been added as an edit for anyone with similar issues.

Cheers! Tim