Database Backed Program For Management Of Snp Data?
4
4
Entering edit mode
12.6 years ago
Faheemmitha ▴ 210

I'm writing to ask if anyone is aware of database backed programs for management of SNP data, besides the ones that I list below. To a first approximation, the purpose of such a tool is to import data (which often requires validation) from upstream source files into a database, and then export it into a form usable by analysis.

As discussed in the paper, tools like PLINK expect the data to already be in a format they can use. However, getting data into that format can be a nightmare, especially if the data is dirty. So, such a system would be a supplement to existing tools. Of course, once the data is in the database, it can used for other things.

I started writing one of these in 2008 in desperation when dealing with some particularly dirty data. This program is called SNPpy. See the PLoS ONE paper online.

I have looked at other software that does this, but only found two, namely SNPLims: a data management system for genome wide association studies, and GWAS Analyzer: integrating genotype, phenotype and public annotation data for genome-wide association study analysis. However, the lead author of SNPLims told me the source code is unavailable, and GWAS Analyzer has (in my opinion) major usability issues. I'm using the source code available here.

I am not aware of any other systems. I find it hard to believe a system like this is not in standard use - perhaps I am missing something. It seems entirely possible that other systems have been created but are proprietary or have simply not been written about.

So, I'm writing to ask if anyone has written or is otherwise using a system like this, aside from those listed here, or if not, is aware of one. Thanks.

EDIT: Updated with the recently published PLoS ONE paper. Note: I'm also trying to upload my SNPpy paper to arXiv, but they have some annoying endorsing procedure, where someone has to endorse me who has recently (at least 2 papers in the last 5 years) uploaded papers to the Quantitative Biology section in arXiv. If you can help, please add a comment. Thanks.

gwas database snp • 4.9k views
ADD COMMENT
1
Entering edit mode

What paper discusses PLINLK? And what is wrong with storing your data in this format? No matter what system you use, you still need a format to input.

ADD REPLY
0
Entering edit mode

This is a great question and has inspired me to ask around to see how others approach the storage of SNP-based information, especially beyond the phenotype association stuff.

ADD REPLY
0
Entering edit mode

@Adrian: My paper talks about PLINK briefly. I'm not sure what you mean by "what is wrong with storing your data in this format?". What format is that?

ADD REPLY
0
Entering edit mode

@Larry: I would be interested in your comments on my paper, if you care to look at it.

ADD REPLY
0
Entering edit mode

What I meant to ask you was, what is wrong with storing your data in PLINK?

ADD REPLY
0
Entering edit mode

You mention that getting data into PLINK format is a nightmare. Wouldn't this be the case as well with other storing format?

ADD REPLY
0
Entering edit mode

@Adrian: I'm not sure what you mean by storing your data in PLINK. Perhaps you mean MAP/PED format (and similar formats like the transposed/long-format/binary ones)? I don't think these are specific to PLINK, though. In any case, there is nothing wrong with these formats, and, yes, getting the data into any file format is problematic, especially if the data is dirty, and one is converting directly from source files. Hence the motivation for using the database (which can do validation) as an intermediate format. Have you looked at my paper? This goes into the motivation at length.

ADD REPLY
2
Entering edit mode
12.6 years ago

for this kind of task (e.g storing genotypes) I use a key/value datastore like berkeleydb, with the following tables:

  • key (chrom,position) , value= rs Marker
  • key (index-of-sample), value= sample
  • key (chrom,position) , value: array of N genotypes. array[0]=genotype for sample[0], etc..

this structure allows to quickly retrieve the genotype for a given individual and a given rs#.

ADD COMMENT
0
Entering edit mode

Thanks for your answer, Pierre. Do you have an implementation of this approach? Then I could test it. I'm not familiar with BDB or the non-relational dbs at all, but may I ask a few lazy questions? 1) Can you do db constraints as done with relational dbs? 2) What kind of speed timings do you get for import/export? 3) Can you use SQL or similar to export the data into e.g. MAP/PED files? 4) The Wikipedia BDB page implies is ACID. Is that correct?

ADD REPLY
0
Entering edit mode

1) quick answer: no 2) faster because you know how your data is structured and there is no need to translate a SQL query. 3) You have to write every part of the code. 4) yes, there is a transaction API for BDB.

ADD REPLY
0
Entering edit mode

I don't have a simple code available for biostar. But you can find a similar approach in this post: http://plindenbaum.blogspot.com/2010/04/short-post-plain-text-vs-binary-data.html

ADD REPLY
1
Entering edit mode
12.6 years ago

While not a "database", R has a lot of features that make it similar to columnar databases. More to the point, you might take a look at:

http://www.bioconductor.org/packages/2.8/bioc/html/snpMatrix.html

ADD COMMENT
0
Entering edit mode

Since this isn't db backed, it is not the kind of thing I was looking for, but thanks for the pointer.

ADD REPLY
1
Entering edit mode
12.5 years ago

Magnus Nordorg's lab has developed a set of resources for warehousing and accessing SNP and GWAS data in Arabidopsis:

I can't vouch for the utility of these tools but I know that the rationale for their design was close to your remit.

ADD COMMENT
0
Entering edit mode

Thanks, this looks interesting.

ADD REPLY
0
Entering edit mode
12.6 years ago

Have you considered tools like MEGA and Arlequin? They are both GUI based software and are aimed at lab geneticists, but they are able to carry out a wide range of analysis and management tasks.

By the way, your SQLAlchemy software is cool :-)

ADD COMMENT
0
Entering edit mode

I looked at MEGA and Arlequin. Both of these seem to be analysis toolkits, though I did not try to use them. Though they both do many things, neither of them appear to do what the tools I listed above do, namely import/export of data. I did look at a section in the Arleqin manual about Import, but I'm not clear what it is doing when importing, if anything. And it doesn't look like a db is involved. (Thanks for your kind words. Have you tried using my project?)

ADD REPLY
0
Entering edit mode

@Giovanni: I looked at MEGA and Arlequin. Both of these seem to be analysis toolkits, though I did not try to use them. Though they both do many things, neither of them appear to do what the tools I listed above do, namely import/export of data. I did look at a section in the Arleqin manual about Import, but I'm not clear what it is doing when importing, if anything. And it doesn't look like a db is involved. (Thanks for your kind words. Have you tried using my project?)

ADD REPLY

Login before adding your answer.

Traffic: 1454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6