Question

Mapping Data In Disparate Text Files Into Database/Xml Schema

2

Entering edit mode

13.0 years ago

Pi ▴ 520

Greeting

Many of the analyses bioinformatics yield disparate datasets in text files such as csv or a specific file format where each column heading represents a different class of information.

I was wondering if there exists a tool whereby you could map the column headings of text files like this to a database schema/xsd and the tool could extract the data and input into your database or create xml formatted data. The tool would have to understand all the relationships in your data. For example one text file might contain lots of rows from a parent table in a 1:M relationship and another file might contain related child rows, and the tool would have to link back to parent and child via a shared identifier.

I have seen a commerical tool Altova MapForce but its not cheap. Are there any bioinformatics tools for this sort of things as it must be a daily requirement for many programmers.

Thank you for you time

mapping • 4.9k views

ADD COMMENT • link updated 13.0 years ago by Neilfws 49k • written 13.0 years ago by Pi ▴ 520

score 1 · Answer 1 · 2011-04-17

1

Entering edit mode

13.0 years ago

Pierre Lindenbaum 161k

Mapping the relationships between the tables can be described using XML by the some ORMs tools (like Hibernate), or by the XMI specification.

Adding semantics to the tables would be a very good idea. I think it is what the
information-artifact-ontology tries to achieve.

ADD COMMENT • link 13.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thank you kindly for your answer. However I was expecting there to be existing tools to do this sort of things like MapForce as it is such a common issue in this field

ADD REPLY • link 13.0 years ago by Pi ▴ 520

Ram · Answer 2 · 2011-04-17

1

Entering edit mode

13.0 years ago

Joachim ★ 2.9k

You can import TSV- and CSV-files directly into MySQL: http://dev.mysql.com/doc/refman/5.6/en/load-data.html

On the command line, you can use mysqlimport: http://dev.mysql.com/doc/refman/5.6/en/mysqlimport.html

You can explicitly address the 1:M relationships as you described them by setting the appropriate columns as primary keys/foreign keys, but you do not have to do that.

You can then use BioMart, http://www.biomart.org/, to make the data accessible through various interfaces / web-services. BioMart picks up on the column names of your tables and will automatically link your data if columns are named alike. It also comes with a configuration tool that renders your tables and the relationships between them graphically. You can make adjustments as needed (for example, when BioMart thinks it sees a "N:M" relationship, but you know it is a "1:M" relationship, really).

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Joachim ★ 2.9k

1

Entering edit mode

I see. Have a look at the command line tools 'cut' and/or 'awk'/'gawk'. You should be able to quickly rearrange the columns of your data-files using these tools. Once the data has been rearranged, you can use 'LOAD DATA' or 'mysqlimport' to get it into your tables.

ADD REPLY • link 13.0 years ago by Joachim ★ 2.9k

0

Entering edit mode

Hello. I have looked at biomart but it isn't quite what I am after I don't think. I already have a defined database schema which i am happy with. I was hoping i could map fields from lots of different text files and some sort of tool could populate my database with the data based on a mapping from the text file columns to database fields

ADD REPLY • link 13.0 years ago by Pi ▴ 520

0

Entering edit mode

Hi, some of the relationships aren't quite as simple as that. For example some of the rows may not have explicit ids and i was hoping the wonder tool I am after could create the ids and manage them for me. I was also hoping it would generate the LOAD DATA statements based on a selection of column names from the text files and corresponding database column names. I was hoping for a nice visual interface to do this with too. Perhaps I am just too lazy but I have used things to do this in the past in other fields but they weren't free.

ADD REPLY • link 13.0 years ago by Pi ▴ 520

0

Entering edit mode

It is very hard to judge what would be the right solution for you without seeing the actual data, but I understand your point better now. This might be completely off-topic, but have you had a look at this list? http://en.wikipedia.org/wiki/Comparison_of_business_integration_software

ADD REPLY • link 13.0 years ago by Joachim ★ 2.9k

0

Entering edit mode

Yes actually. Altova MapForce on that list is one I've looked at but its not free

ADD REPLY • link 13.0 years ago by Pi ▴ 520

score 1 · Answer 3 · 2011-04-17

One solution to this type of problem is to get the data into a database first, then worry about the schema later.

A useful tool could be one of the so-called "NoSQL" databases. For example, MongoDB has import/export tools which allow import of CSV or TSV files. This gets you quickly to structured data; then you can put a schema on top using an ORM. For Ruby users, Mongoid and MongoMapper are both excellent options.

MongoDB does not export XML, but it will export JSON. It's relatively easy to either convert JSON to XML or export XML using a library. Again, for Ruby, Crack is a useful gem for XML/JSON parsing, interconversion and export from a hash structure.

In general, no tool will "understand your data"; you have to provide some definitions.