I've been setting up a data management system using Berkeley DB and the python bindings (bsddb3) to run parallel tasks for ~ 80 million RNA sequences that have properties similar to the "Target RNA" schema below.
length, sequence, expression, mismatchPositions, etc.
10 Parallel processes that access separate records ("rows") and write to the database run very slowly. Each job seems to run slower than the previous, perhaps indicating some sort of locking issue? For instance, if I do a test to update the length of 10 million sequences I'll run 10 separate jobs that access different records (0-1,000,000 for the first job, 1,000,001-2,000,000 for the second, etc.) like so:
#n is job number, dbFN is the filelocation of database import bsddb3 db = bsddb3.db.DB() db.set_cachesize(1,0) db.open(dbFN, None, bsddb3.db.DB_HASH, bsddb3.db.DB_CREATE) for i in range(n*1,000,000, (n+1)*1,000,000): db[str(i)] = str(int(db[str(i)]) + 1) db.close()
And the runtime (s) for each job finishing are something like: 30, 50, 60, 70, etc. Running only one job to update the whole 10 million sequences will take ~ 120s. It seems to slow down the cluster even though the database is 200 MB total. Running 100 jobs that are read only works fine. I'd like to be able to run 30+ jobs concurrently so not being able to run 10 worries me.
- Each Processs is accessing different records, why would the speed of the nth job depend on the previous jobs? Doesn't bdb lock on records as opposed to entire DBs? Will manually using transactions speed this up?
- In my previous post there was mention of using redis to run 100 jobs running in parallel all accessing the same memory file, is this something berkeley db is not capable of doing because the records aren't stored in ram like redis?
- How can I tell which part of the system is causing the problem? Is it a bdb problem or a network/filesystem problem. I'm fairly ignorant about filesystems/networks and don't know how to troubleshoot this type of problem. Any tips would be appreciated.
- Is there flags I can pass to allow easier concurrency or db parameters I can change like page sizes to make it more efficient?
- If the problem is due to my network/filesystem setup, how is your system set up to all for this type of concurrency?
- bdb 4.8 with bsddb3
- 5 nodes (w/ Gigabit ethernet connections) using NFS
- "dd throughput": 60MB/s
- cache: 1GB for each db connection
- each db is a HASH type.
If any more info is needed I'll update ASAP.
FYI, Berkeley DB does not support synchronous database access by multiple processes FOR NFS (as mentioned in the comments) http://www.oracle.com/technetwork/database/berkeleydb/db-faq-095848.html
Also, Tokyo/Kyoto cabinet do not support simultaneous database access via multiple processes at all - Tokyo/Kyoto Tyrant/Tycoon have to be used instead, although I'm not sure if even these are safe with NFS?