How Can I Unify Genome Data From Different Sources?
4
7
Entering edit mode
13.4 years ago

I'm trying to think about the best ways to retrieve and unify data from many different genome databases into a common format I can work with locally?

I'm doing comparative genomics analysis and have been using EnsEMBL as my primary source to date, which has excellent APIs to access the various annotation features, but now I would like to retrieve sequences from databases such as FlyBase, GenBank etc and I'm trying to think of the best way to approach it?

Perhaps write separate wrappers/parsers and pull the data into a SQLite database, GFF or Chaos-XML file? I want something that can be portable too if possible!?

python comparative database genome • 4.0k views
ADD COMMENT
3
Entering edit mode
13.4 years ago

Unifying the different genome databases ? Isn't it one of the "Holy Grails" of all the Bioinformaticians ? :-)

Just as a suggestion, you could have a look at RDF (Resource Description Framework) as a format to unify those data.(see also bio2rdf )

ADD COMMENT
0
Entering edit mode

Lol! Thanks Pierre, I'll take a look at that :-)

ADD REPLY
3
Entering edit mode
13.4 years ago
Neilfws 49k

As Pierre says, you've identified the biggest problem in bioinformatics: data integration.

I think GFF is a good option. There are plenty of libraries around to read, write and convert to/from GFF. See, for example, this Bioperl page. Using Bioperl also has the advantage that GFF can be loaded quickly into a MySQL database. See these load scripts, described as part of the GMOD GBrowse project.

For another SQL-based option, I would look at BioSQL. In its own words: "a generic relational model covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies." There are binding for the major Bio* projects (Bioperl, BioPython, BioRuby, BioJava). People have also written load scripts; for example using Perl:

bpload_seqdatabase.pl -dbuser USER -dbname biosql -namespace swissprot -format swiss sprot40.dat

to load SwissProt into a BioSQL database. The Perl loaders will handle any sequence format recognised by Bio::SeqIO.

ADD COMMENT
3
Entering edit mode

I'm not sure whether there is a SQLite schema. And performance would not be great for anything more than rather small amounts of data.

ADD REPLY
1
Entering edit mode

I think either way would work. GFF is quite easy to work with, since it's practically a tab-delimited format - so all the usual shell tools for text processing. Or, you can take advantage of the Bio* methods. Adding annotation is easy with something like Bioperl's feature annotation.

ADD REPLY
0
Entering edit mode

Interesting, thanks! I'm actually now thinking maybe BioSQL with SQLite might suit me? I want something portable, which means MySQL would be unfeasible really, but using BioSQL with SQLite drivers might just work?

ADD REPLY
0
Entering edit mode

http://biopython.org/SRC/biopython/NEWS - some information here under the 1.53 release, states that sqlite support has been built into biosql with a draft schema as of Dec 2009. Going to have to read up on RDF and GFF. Perhaps simple flat file FASTAs using the local filesystem may be the way to go, as long as data retrieval times are optimal? I know it can take a couple of hours using EnsEMBL Perl API to download all exon sequences for an organism for example!

ADD REPLY
0
Entering edit mode

Flat files can be fine for retrieval if they're indexed. The Bio* projects have tools to do that.

ADD REPLY
0
Entering edit mode

Just reading through the GFF3 specification now, looks like an excellent format and portable too!? Parsing should be relatively quick given the Bio* wrappers? I'm just wondering about ad hoc sequence access however and if you can append annotations and sequences to the file, or if it would be best to import the whole genomic sequences into GFF first?

ADD REPLY
2
Entering edit mode
13.4 years ago
lh3 33k

Researchers have been aware of the format issues for many years. GFF and BED are "common formats" for annotations, Fasta/q for sequences, Wiggle for signals, SAM for read alignments, NH/NHX/PhyloXML for trees and VCF for SNPs/indels. There is a common format in almost each field, and each sensible database has already exported its data in one of these formats, or a service/library is available to convert formats for you. You do not need to do much unless you are fond of programming.

BTW, if you work with gigabytes of data, do not use XML. XML is great for many purposes, but it is the evil for that amount of data.

ADD COMMENT
1
Entering edit mode

My experience is SQL is over complicated for small projects. Simply using the filesystem to organize your data is sufficient and saves you time. SQL is only useful when you work with different types of data and complex relationships for a fairly long time. In this case, you may waste your time at the beginning, but the efforts will pay off eventually. Whether to use SQL depends on your projects.

ADD REPLY
0
Entering edit mode

I'd like to be able to store annotations, sequences, perhaps even trees in a common format I can work with locally, but that can also be portable! Perhaps SQLite will be most useful? With parsers to retrieve and convert the data?

ADD REPLY
0
Entering edit mode

My experience is SQL is over complicated for small projects. Simply using the filesystem to organize your data is sufficient and saves you time. SQL is only useful when you work with different types of data and complex relationships for a fairly long time. In this case, you may waste your time at the beginning, but the efforts will pay off eventually.

ADD REPLY
2
Entering edit mode
13.4 years ago
Mndoci ★ 1.2k

One approach, and that works well especially if you are pulling data from different domains is to write a core schema for critical data types and extensions that allow you to add in new sources in the future. It's much easier to write parsers once you've designed a schema.

I wouldn't worry about trying to get data into a common format. That's never going to work with all the change that happens. You're much better off building a SQL system, or some other query-able environment and writing appropriate wrappers. More consistent, less error prone and then you can build on top of it

ADD COMMENT

Login before adding your answer.

Traffic: 2060 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6