Your SQL and Python will not improve if you use someone else's implementation ;)
sqlite3 is well supported in python and is really fast despite it's "lite" namesake. You should probably go with it, but remember it does have some drawbacks:
- Cannot rename a column without copying/deleting the existing table.
- No support for MERGE or UPDATE FROM (copying data from one place into only 1 column of an existing table. you need to 'update' the whole row)
- Cannot specify field length (eg. INT 6 digits) - which makes databases much smaller and faster. sqlite has a 'feature' that allows you to put any data type into any field, regardless of what you set the data type of the column to in the first place.
- Despite what you might think, SQL commands often can NOT be used in all SQL implementations. Learn them as if they are implementation-specific, and then be pleasantly surprised if they happen to work on another implementations :)
- There are many ways to import SQL data right into R. The best that i've ever seen is with the RSQLite package. Three lines and you've got your dataframe
- the "... WHERE column BETWEEN x AND y" is awesomely quick in SQL. If your BETWEEN'ing on an indexed column, it'll be faster than bedtools intersect. Operators (< == > !=, etc) are all also highly optimised in low-level code in most SQL implementations. The JOIN command is a pain in the ass to learn/understand - particularly as it always comes bundled with aliasing in examples, but LEFT JOIN, RIGHT JOIN, UNION, etc, basically replicates all the types of intersect you could possibly want to do, and they're mega fast. It's worth getting all that down. SQLite3 doesn't support right join, but MySQL does. Whatever.
- If you're just starting out and SQL-injection isn't a problem (because you're the only one running your python scripts), then I wouldn't bother with variable substitution like: c.execute('insert into tablename values (?,?,?)', item)
It becomes confusing when you have the wrong number of variables, or you forgot you need to supply a tuple, or whatever. It's so much clearer to say: c.execute(' INSERT into " + str(whatever) + " values (" + whatever + ") ')
It looks ugly, you could inject SQL code in there, but it's so much easier to debug when you're starting out and unsure if it's the SQL or the Python that you're having problems with.
modified 5.4 years ago
5.5 years ago by
John ♦ 12k