Importing Uniprot into BioSQL using BioPython takes years (nearly literally!)
1
1
Entering edit mode
3.8 years ago
Jan ▴ 20

Dear all,

I'l try to import the current TrEMBL database into BioSQL for further analysis. Currently the TrEMBL has about 0.5TB and 107M entries. So far I downloaded the XML and splitted it up into files of roughly 20.000 entries (=6400 files à 90MB). This went reasonably fast. Now I want to import these files into BioSQL. However while reading the file using biopython takes less than a minute. Exporting them to SQL takes more than 17 minutes. With 6400 files x 17 min, I am looking at around 75 days before the insert is complete! Is it my hardware (standard desktop i5, 4GB RAM), which is just no capable to handle this kind of data or am I doing something seriously wrong on the programming site?

Thanks for any performance tips! Best, Jan

Here is my little I use for the import:

from Bio import Entrez
from Bio import SeqIO
from BioSQL import BioSeqDatabase
import itertools
import sys
import cProfile
import time

wait_f_min = 6*60*0

for i in range (1,wait_f_min):
sys.stdout.write("Sleeping for {:5} minutes...\r".format(wait_f_min-i))
sys.stdout.flush()
time.sleep (60)

print ("Connect to server")
server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "nytech", host = "localhost", db="bioseqdb")

if "TrEMBL" in server:
print ("Deleting old database")
del (server["TrEMBL"])
server.commit()

db = server.new_database("TrEMBL", description="all TREMBl proteins Jan/2018")
server.commit()

count = 1

print ("Starting parsing files")

start_time = time.time()
total = 64000   # total amount of files

print ("Start {}".format(time.strftime("%H:%M:%S")))
while True:
loop_start = time.time()
count = count+1
print ("File # {} reading ".format(count), end='')
sys.stdout.flush()
try:
handle = open (r'g:\workfile_'+str(count)+'.xml',"r")
except:
print ("Error opening file!") # probably finished with all files
sys.exit ()
records = list(SeqIO.parse(handle, "uniprot-xml"))
print (" R{:5.2f} ".format((time.time()-loop_start)/60), end='')   # Read time for files
sys.stdout.flush()
write_start = time.time()
db.load (records)            # Write records to DB
server.commit()
loop_time= time.time() - loop_start
print (" W {:5.2f}".format((time.time()-write_start)/60), end='')
print (" T {:5.2f}".format(loop_time/60),end='')
to_go = loop_time * (total-count) / 3600
print (" to go: {:3.2f}h ".format(to_go))

print ("End {}".format(time.strftime("%H:%M:%S")))

Biopython BioSQL UniProt • 1.4k views
2
Entering edit mode

This may be a long shot... Try to move the server.commit() outside the while loop. I.e. load everything first and then commit in bulk. By the way, it seems that most of the code of BioSql is 8-10 years old. Maybe it is not tuned to cope with the amount of data you have.

2
Entering edit mode

This is a comment because it is not at all answering your question. Instead of building your own database why not use our SPARQL.uniprot.org. This is likely to be better modelled than biosql is for the UniProt datamodel.

Otherwise I suspect your hardware is not at the level that one would use for a database that size.

If you just need sequence, organism use the tab download options on www.uniprot.org and use mysql tab/comma delimited file import functionality.

0
Entering edit mode

My final goal is to select certain protein sequences (like "all mammalian proteins") and use a regular expression to search for certain motifs.... This might take a while, so I am unsure that a remote database will be able to handle that... am I wrong?

0
Entering edit mode

It would probably work ok on our endpoint. But I think we got just the tool you want, namely prosite. I would also have a look at not using a database at all in that case as they will not perform optimally for your use case.

0
Entering edit mode

Dear me, thanks for your reply! I know prosite, but after a quick glance over the "pattern language", it does not look like it is powerful enough for my search requests.

I would also have a look at not using a database at all in that case as they will not perform optimally for your use case. What do you mean with that? Can you give a a hint what kind of data storage would be better in this context?

I actually do wonder if I just use a taxonomy tree (potentially stored in mysql) and split all TrEMBL entries into "single species files"...

Regards, Jan

0
Entering edit mode

Dear dariober,

thanks for your answer. Taking the commit out of the loop was my first iteration of this script. The only major difference is, that whenever the script exits for any reason, the database does a full roll-back, which can actually take longer than the original queing. However, to put numbers to that, I added a timestamp before and after the commit. So here are the results for 20.000 entries:

File # 1 reading R 0.17 W 7.08 C 0.00 T 7.25 to go: 773.24h

On a different aspect, I noticed that due to the structure of BioSQL there are more than XXXX rows created in the various database for the "only" 20.000 entries. Maybe it just a typical mySQL feature (acutally mariaDB in my case) that these amounts of queries take so long? Would another Backend (PostgreSQL or anything else) be faster?

One further thing I noticed. My files should have 20.000 entries but the first one only has 10.000 (that's okay, I splitted the XML mostly on size). However if I look at my database after committing just the first file I have the following NEW rows:

• bioentry 10,744
• [..].
• dbxref 101,971
• bioentry_dbxref 281,767
• bioentry_qualifier_value 410,702

So for roughly 10.000 entries there are neraly 1,000,000 entries in my database. Maybe this is really a problem and I should NOT import all the data (in principle I only need the sequence and the organism...) I will try that and keep this thread update.

However, I am still interested in advice and further tipps!

Best, Jan

1
Entering edit mode

Sorry, I can't be more specific as I'm not familiar with BioSQL and, of course, with your task. But...

in principle I only need the sequence and the organism...

I wanted to say this in my first comment... Maybe you should move away from BioSQL altogether and parse the xml files to extract what you need. I think having data in a database is useful if you regularly need to do complex queries. For a one-off data extraction step, just parse the xml.

0
Entering edit mode

I really would like to have the sequences in a taxonomy tree, to ask for "all mammalian sequences" or "all nematode species" etc... This is difficult if you always have to parse the TB XML file...

1
Entering edit mode
3.8 years ago
Jan ▴ 20

Thanks for all the tips and advice up to now.

My current solution is to radically cut-down the information of the TrEMBL XML.

I only need the organism (in an taxonomy tree) and the most basic entry (sequence and id). So I deleted everything else.

Instead of inserting roughly 100 rows per entry in mySQL I only insert now 5 per entry. This speed up the transaction dramatically, from estimated 75 days to 4. Four is still a lot but more or less only a looong-weekend. I will see how usable the database will be later on...

Here is my current program code for it is worth:

from Bio import Entrez
from Bio import SeqIO
from BioSQL import BioSeqDatabase
import itertools
import sys
import cProfile
import time
from pprint import pprint

wait_f_min = 6*60*0

for i in range (1,wait_f_min):
sys.stdout.write("Sleeping for {:5} minutes...\r".format(wait_f_min-i))
sys.stdout.flush()
time.sleep (60)

print ("Connect to server")
server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "nytech", host = "localhost", db="bioseqdb")

if "TrEMBL" in server:
print ("Deleting old database")
del (server["TrEMBL"])
server.commit()

if "TrEMBL" not in  server:
db = server.new_database("TrEMBL", description="all TREMBl proteins Jan/2018")

db = server["TrEMBL"]
server.commit ()

count = 1

print ("Starting parsing files")

start_time = time.time()
total = 6400   # total amount of files

print ("Start {}".format(time.strftime("%H:%M:%S")))
while True:
loop_start = time.time()
count = count+1
print ("File # {} reading ".format(count), end='')
sys.stdout.flush()
try:
handle = open (r'g:\workfile_'+str(count)+'.xml',"r")
except:
print ("Error opening file!") # probably finished with all files
sys.exit ()
records = list(SeqIO.parse(handle, "uniprot-xml"))
for record in records:
del (record.dbxrefs[:])
del (record.features[:])
record.letter_annotations.clear()
try:
organism = record.annotations['organism']
except:
organism = False
try:
recommendedName_fullName = record.annotations['recommendedName_fullName']
except:
recommendedName_fullName = False
record.annotations.clear()
if organism:
record.annotations['organism'] = organism
if recommendedName_fullName:
record.annotations['recommendedName_fullName'] = recommendedName_fullName
#pprint (records[0])
#pprint (records[0])
#quit()
#dir (records[0])
print (" R{:5.2f} ".format((time.time()-loop_start)/60), end='')   # Read time for files
sys.stdout.flush()
write_start = time.time()

db.load (records)            # Write records to DB
print (" W {:5.2f}".format((time.time()-write_start)/60), end='')
sys.stdout.flush()
commit_start = time.time()
server.commit()
print (" C {:5.2f}".format((time.time()-commit_start)/60), end='')
loop_time= time.time() - loop_start
print (" T {:5.2f}".format(loop_time/60),end='')
to_go = (loop_time * (total-count)) / 3600
print (" to go: {:5.2f}h ".format(to_go))

print ("End {}".format(time.strftime("%H:%M:%S")))