Question: Load data from NCBI .gpff files into a SQL database. Looking for a quicker way.
0
gravatar for azotov
10 days ago by
azotov0
UAB, Birmingham AL, USA
azotov0 wrote:

I'm trying to load NCBI reference sequence data (from fftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein.1.protein.gpff.gz) into an SQL server for further processing. I'm using a straightforward method with BioPython of parsing the text file with seqIO.parse() and then loading it into SQL with BioSeqDatabase.load(). That converts data at the rate of around 10 entries/second if I have an SQL server on the same computer or slightly slower if SQL is remote, what seems too slow. I was not able to run multiple python scripts in parallel. It runs into error 1205 : "Lock wait timeout exceeded", so it looks like only 1 BioSeqDatabase.load() function at a time can access the SQL tables. I have access to a massive cluster, but I don't see a way of utilizing it for this task.

What is a good way to speed up the process by parallelizing or else?

bioseqdatabase sql biopython • 126 views
ADD COMMENTlink modified 6 days ago • written 10 days ago by azotov0
1
gravatar for Pierre Lindenbaum
10 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum100k wrote:

normalize your data to create a set of flat files (creating your primary keys by hand) and then use mysqlimport https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html

ADD COMMENTlink written 10 days ago by Pierre Lindenbaum100k

Works like a charm! Thanks Pierre

ADD REPLYlink written 6 days ago by azotov0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1447 users visited in the last hour