Question

Load data from NCBI .gpff files into a SQL database. Looking for a quicker way.

1

Entering edit mode

6.5 years ago

azotov ▴ 10

I'm trying to load NCBI reference sequence data (from fftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein.1.protein.gpff.gz) into an SQL server for further processing. I'm using a straightforward method with BioPython of parsing the text file with seqIO.parse() and then loading it into SQL with BioSeqDatabase.load(). That converts data at the rate of around 10 entries/second if I have an SQL server on the same computer or slightly slower if SQL is remote, what seems too slow. I was not able to run multiple python scripts in parallel. It runs into error 1205 : "Lock wait timeout exceeded", so it looks like only 1 BioSeqDatabase.load() function at a time can access the SQL tables. I have access to a massive cluster, but I don't see a way of utilizing it for this task.

What is a good way to speed up the process by parallelizing or else?

biopython SQL bioseqdatabase • 2.1k views

ADD COMMENT • link 6.5 years ago by azotov ▴ 10

score 3 · Answer 1 · 2017-11-10

3

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

normalize your data to create a set of flat files (creating your primary keys by hand) and then use mysqlimport https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html