Question: Load data from NCBI .gpff files into a SQL database. Looking for a quicker way.
8 months ago by
UAB, Birmingham AL, USA
azotov10 wrote:

I'm trying to load NCBI reference sequence data (from f into an SQL server for further processing. I'm using a straightforward method with BioPython of parsing the text file with seqIO.parse() and then loading it into SQL with BioSeqDatabase.load(). That converts data at the rate of around 10 entries/second if I have an SQL server on the same computer or slightly slower if SQL is remote, what seems too slow. I was not able to run multiple python scripts in parallel. It runs into error 1205 : "Lock wait timeout exceeded", so it looks like only 1 BioSeqDatabase.load() function at a time can access the SQL tables. I have access to a massive cluster, but I don't see a way of utilizing it for this task.

What is a good way to speed up the process by parallelizing or else?

bioseqdatabase sql biopython • 402 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by azotov10
8 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

normalize your data to create a set of flat files (creating your primary keys by hand) and then use mysqlimport

ADD COMMENTlink written 8 months ago by Pierre Lindenbaum109k

Works like a charm! Thanks Pierre

ADD REPLYlink written 8 months ago by azotov10
