Mapping Data In Disparate Text Files Into Database/Xml Schema
3
2
Entering edit mode
13.0 years ago
Pi ▴ 520

Greeting

Many of the analyses bioinformatics yield disparate datasets in text files such as csv or a specific file format where each column heading represents a different class of information.

I was wondering if there exists a tool whereby you could map the column headings of text files like this to a database schema/xsd and the tool could extract the data and input into your database or create xml formatted data. The tool would have to understand all the relationships in your data. For example one text file might contain lots of rows from a parent table in a 1:M relationship and another file might contain related child rows, and the tool would have to link back to parent and child via a shared identifier.

I have seen a commerical tool Altova MapForce but its not cheap. Are there any bioinformatics tools for this sort of things as it must be a daily requirement for many programmers.

Thank you for you time

mapping • 4.9k views
ADD COMMENT
1
Entering edit mode
13.0 years ago

Mapping the relationships between the tables can be described using XML by the some ORMs tools (like Hibernate), or by the XMI specification.

Adding semantics to the tables would be a very good idea. I think it is what the
information-artifact-ontology tries to achieve.

ADD COMMENT
0
Entering edit mode

thank you kindly for your answer. However I was expecting there to be existing tools to do this sort of things like MapForce as it is such a common issue in this field

ADD REPLY
1
Entering edit mode
13.0 years ago
Joachim ★ 2.9k

You can import TSV- and CSV-files directly into MySQL: http://dev.mysql.com/doc/refman/5.6/en/load-data.html

On the command line, you can use mysqlimport: http://dev.mysql.com/doc/refman/5.6/en/mysqlimport.html

You can explicitly address the 1:M relationships as you described them by setting the appropriate columns as primary keys/foreign keys, but you do not have to do that.

You can then use BioMart, http://www.biomart.org/, to make the data accessible through various interfaces / web-services. BioMart picks up on the column names of your tables and will automatically link your data if columns are named alike. It also comes with a configuration tool that renders your tables and the relationships between them graphically. You can make adjustments as needed (for example, when BioMart thinks it sees a "N:M" relationship, but you know it is a "1:M" relationship, really).

ADD COMMENT
1
Entering edit mode

I see. Have a look at the command line tools 'cut' and/or 'awk'/'gawk'. You should be able to quickly rearrange the columns of your data-files using these tools. Once the data has been rearranged, you can use 'LOAD DATA' or 'mysqlimport' to get it into your tables.

ADD REPLY
0
Entering edit mode

Hello. I have looked at biomart but it isn't quite what I am after I don't think. I already have a defined database schema which i am happy with. I was hoping i could map fields from lots of different text files and some sort of tool could populate my database with the data based on a mapping from the text file columns to database fields

ADD REPLY
0
Entering edit mode

Hi, some of the relationships aren't quite as simple as that. For example some of the rows may not have explicit ids and i was hoping the wonder tool I am after could create the ids and manage them for me. I was also hoping it would generate the LOAD DATA statements based on a selection of column names from the text files and corresponding database column names. I was hoping for a nice visual interface to do this with too. Perhaps I am just too lazy but I have used things to do this in the past in other fields but they weren't free.

ADD REPLY
0
Entering edit mode

It is very hard to judge what would be the right solution for you without seeing the actual data, but I understand your point better now. This might be completely off-topic, but have you had a look at this list? http://en.wikipedia.org/wiki/Comparison_of_business_integration_software

ADD REPLY
0
Entering edit mode

Yes actually. Altova MapForce on that list is one I've looked at but its not free

ADD REPLY
1
Entering edit mode
13.0 years ago
Neilfws 49k

One solution to this type of problem is to get the data into a database first, then worry about the schema later.

A useful tool could be one of the so-called "NoSQL" databases. For example, MongoDB has import/export tools which allow import of CSV or TSV files. This gets you quickly to structured data; then you can put a schema on top using an ORM. For Ruby users, Mongoid and MongoMapper are both excellent options.

MongoDB does not export XML, but it will export JSON. It's relatively easy to either convert JSON to XML or export XML using a library. Again, for Ruby, Crack is a useful gem for XML/JSON parsing, interconversion and export from a hash structure.

In general, no tool will "understand your data"; you have to provide some definitions.

ADD COMMENT
0
Entering edit mode

Thank you for your answer. I was expecting to provide definitions but i was hoping some tool would exist whereby i could supply a database schema or xsd and then map columns in my text files to these schema definitions then the tool could automagically populate a database or xml file with the data from my text files.

ADD REPLY

Login before adding your answer.

Traffic: 2494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6