Question: Custom database for genomic data
gravatar for A. Domingues
5.5 years ago by
A. Domingues2.2k
Dresden, Germany
A. Domingues2.2k wrote:

I am wondering whether there is some tool/scripts available to create a database of genomic data from a set of files? These files are mostly in bed format, but there are also results files from peaks called with MACS and gene expression data from cufflinks.


Say I need a list of genes ( and coordinates) that have a peak for a certain factor, but not another, and genes genes must be expressed. At the moment, I use custom R scripts, which do a series of merges, annotations, and filtering steps to get the final table. However, this is not very elegant or efficient, and to me it seems like a classical problem that could be solved by implementing a database.

As it is often the case in bioinformatics, it is likely that someone already came up with a solution. Trouble is, I can't find it. So before I start implementing something like that myself, does anyone know of an existing solution?



rna-seq chip-seq database • 2.4k views
ADD COMMENTlink modified 5.5 years ago by Pierre Lindenbaum125k • written 5.5 years ago by A. Domingues2.2k

how about storing your files as bzgip+tabix and querying everything using tabix or bedtools ?

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum125k

I have not thought of tabix - well, I never really used it - but you got me curious.

From what I just read about it, each bed file would still be separate and the query would be done independently right? Let's say each bed represent different factor's binding sites, and I would like to know to which of my genes is bound by a factor but not another?

ADD REPLYlink written 5.5 years ago by A. Domingues2.2k

Building on what Pierre suggested... since you're already using R, are you using the genomeIntervals package? It has some of the functionality of bedtools and might be easy to integrate to your existing scripts. Although, it sounds like you're interested in using SQL and Python partly for the learning experience, so in that case go for it (if you have the time)!

ADD REPLYlink written 5.5 years ago by Katie D'Aco1000

I am using R, but not that package.

No one has the time :) But you are right. For me this would be a nice (and useful) side project to make something more large scale in python (already in use for smaller stuff), and learn SQL in the process (it would be implemented in sqlite). As mentioned, my current approach is good enough, but it somehow feels rough.

ADD REPLYlink written 5.5 years ago by A. Domingues2.2k

Note that the GenomicRanges package is also a good resource and is a bit more mainstream than genomeIntervals.  Your mileage may vary, though.  In python, consider the bx-python library.

ADD REPLYlink written 5.5 years ago by Sean Davis25k

In R I already use GRanges, but in Python so far it has been pybedtools. Very good package btw.

ADD REPLYlink written 5.5 years ago by A. Domingues2.2k

There is also the pygenes package in python.

ADD REPLYlink written 5.5 years ago by Sean Davis25k
gravatar for Pierre Lindenbaum
5.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

just one tip If you want to use a whatever-sql-database; use a bin column: The 'Bin' Column Used By Sam, Ucsc...

ADD COMMENTlink written 5.5 years ago by Pierre Lindenbaum125k

Or use a database that supports range queries:

ADD REPLYlink written 5.5 years ago by Sean Davis25k

That is an awesome trick :) I'm going to have to read your blog daily from now on :D

ADD REPLYlink written 5.5 years ago by John12k
gravatar for John
5.5 years ago by
John12k wrote:

Your SQL and Python will not improve if you use someone else's implementation ;)

Some tips:
sqlite3 is well supported in python and is really fast despite it's "lite" namesake. You should probably go with it, but remember it does have some drawbacks:
- Cannot rename a column without copying/deleting the existing table.
- No support for MERGE or UPDATE FROM (copying data from one place into only 1 column of an existing table. you need to 'update' the whole row)
- Cannot specify field length (eg. INT 6 digits) - which makes databases much smaller and faster. sqlite has a 'feature' that allows you to put any data type into any field, regardless of what you set the data type of the column to in the first place.
- Despite what you might think, SQL commands often can NOT be used in all SQL implementations. Learn them as if they are implementation-specific, and then be pleasantly surprised if they happen to work on another implementations :)
- There are many ways to import SQL data right into R. The best that i've ever seen is with the RSQLite package. Three lines and you've got your dataframe
- the "... WHERE column BETWEEN x AND y" is awesomely quick in SQL. If your BETWEEN'ing on an indexed column, it'll be faster than bedtools intersect. Operators (< == > !=, etc) are all also highly optimised in low-level code in most SQL implementations. The JOIN command is a pain in the ass to learn/understand - particularly as it always comes bundled with aliasing in examples, but LEFT JOIN, RIGHT JOIN, UNION, etc, basically replicates all the types of intersect you could possibly want to do, and they're mega fast. It's worth getting all that down. SQLite3 doesn't support right join, but MySQL does. Whatever.

- If you're just starting out and SQL-injection isn't a problem (because you're the only one running your python scripts), then I wouldn't bother with variable substitution like: c.execute('insert into tablename values (?,?,?)', item)
It becomes confusing when you have the wrong number of variables, or you forgot you need to supply a tuple, or whatever. It's so much clearer to say: c.execute(' INSERT into " + str(whatever) + " values (" + whatever + ") ')
It looks ugly, you could inject SQL code in there, but it's so much easier to debug when you're starting out and unsure if it's the SQL or the Python that you're having problems with.

ADD COMMENTlink modified 5.4 years ago • written 5.5 years ago by John12k

Brilliant advice, specially the first line :D

As mentioned in my comment to Katie, sqlite is indeed my preferred implementation. I am aware that in some instances the entire database needs to be loaded, but for the type of data and number of tables, this is something I can live with. Recreating the db after peaks for a new TF have been called, will take little time, and this time will be recovered later with the queries - or so I expect.

Thanks a lot for the feedback.

ADD REPLYlink written 5.5 years ago by A. Domingues2.2k
gravatar for pld
5.5 years ago by
United States
pld4.8k wrote:

Chances are a preexisting schema will not fit the data you have or the structure you want, meaning any scripts built on that schema will not work with the data you have or result in the capabilities you desire. You might be able to retool something that currently exists, but this would probably be at least as much work as doing it on your own.

You're best bet would be to learn some SQL or something similar so you can first design your database. From there you can write your own scripts for uploading, or use whatever integrated tools come with the RDBMS/DBMS. It is a skill that is very useful and there is a huge degree of online support and tutorials for whatever database platform you choose to go with.

ADD COMMENTlink written 5.5 years ago by pld4.8k

1, Yes, existing schemas might not fit my requirements exactly, but something based on bed files and gene tables would be a start.

2. I forgot to mention that I already started working on such a db with the long view of improving my SQL and python skills (I already know a bit but this would a good project to hone them). But since I have barely started, it would be nice to know if there was something out there that I could use either on the short-term, or as inspiration.


ADD REPLYlink written 5.5 years ago by A. Domingues2.2k

I suppose you can hunt around for publicly available schema, but you should use this as an opportunity to gain experience rather than borrow as much code as you can (especially if you don't understand how that code works).

Lastly, you might want to consider how frequently the database will be updated/altered. If you're just uploading data in bulk, your support scripts should be focused on getting your data from the raw source to a format that can be uploaded in bulk (e.g. the MySQL bulk uploader). You only want to have scripts altering data on a per-row basis if you really need such fine grain control, otherwise you'll waste a ton of time.

ADD REPLYlink written 5.5 years ago by pld4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2478 users visited in the last hour