I'm trying to load NCBI reference sequence data (from fftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein.1.protein.gpff.gz) into an SQL server for further processing. I'm using a straightforward method with BioPython of parsing the text file with seqIO.parse() and then loading it into SQL with BioSeqDatabase.load(). That converts data at the rate of around 10 entries/second if I have an SQL server on the same computer or slightly slower if SQL is remote, what seems too slow. I was not able to run multiple python scripts in parallel. It runs into error 1205 : "Lock wait timeout exceeded", so it looks like only 1 BioSeqDatabase.load() function at a time can access the SQL tables. I have access to a massive cluster, but I don't see a way of utilizing it for this task.
What is a good way to speed up the process by parallelizing or else?
Works like a charm! Thanks Pierre