Question: Mapping Data In Disparate Text Files Into Database/Xml Schema
gravatar for Pi
9.5 years ago by
Pi520 wrote:


Many of the analyses bioinformatics yield disparate datasets in text files such as csv or a specific file format where each column heading represents a different class of information.

I was wondering if there exists a tool whereby you could map the column headings of text files like this to a database schema/xsd and the tool could extract the data and input into your database or create xml formatted data. The tool would have to understand all the relationships in your data. For example one text file might contain lots of rows from a parent table in a 1:M relationship and another file might contain related child rows, and the tool would have to link back to parent and child via a shared identifier.

I have seen a commerical tool Altova MapForce but its not cheap. Are there any bioinformatics tools for this sort of things as it must be a daily requirement for many programmers.

Thank you for you time

mapping • 3.9k views
ADD COMMENTlink written 9.5 years ago by Pi520
gravatar for Pierre Lindenbaum
9.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Mapping the relationships between the tables can be described using XML by the some ORMs tools (like Hibernate), or by the XMI specification.

Adding semantics to the tables would be a very good idea. I think it is what the
information-artifact-ontology tries to achieve.

ADD COMMENTlink written 9.5 years ago by Pierre Lindenbaum131k

thank you kindly for your answer. However I was expecting there to be existing tools to do this sort of things like MapForce as it is such a common issue in this field

ADD REPLYlink written 9.5 years ago by Pi520
gravatar for Joachim
9.5 years ago by
San Francisco, California
Joachim2.9k wrote:

You can import TSV- and CSV-files directly into MySQL:

On the command line, you can use mysqlimport:

You can explicitly address the 1:M relationships as you described them by setting the appropriate columns as primary keys/foreign keys, but you do not have to do that.

You can then use BioMart,, to make the data accessible through various interfaces / web-services. BioMart picks up on the column names of your tables and will automatically link your data if columns are named alike. It also comes with a configuration tool that renders your tables and the relationships between them graphically. You can make adjustments as needed (for example, when BioMart thinks it sees a "N:M" relationship, but you know it is a "1:M" relationship, really).

ADD COMMENTlink modified 13 months ago by RamRS30k • written 9.5 years ago by Joachim2.9k

I see. Have a look at the command line tools 'cut' and/or 'awk'/'gawk'. You should be able to quickly rearrange the columns of your data-files using these tools. Once the data has been rearranged, you can use 'LOAD DATA' or 'mysqlimport' to get it into your tables.

ADD REPLYlink written 9.5 years ago by Joachim2.9k

Hello. I have looked at biomart but it isn't quite what I am after I don't think. I already have a defined database schema which i am happy with. I was hoping i could map fields from lots of different text files and some sort of tool could populate my database with the data based on a mapping from the text file columns to database fields

ADD REPLYlink written 9.5 years ago by Pi520

Hi, some of the relationships aren't quite as simple as that. For example some of the rows may not have explicit ids and i was hoping the wonder tool I am after could create the ids and manage them for me. I was also hoping it would generate the LOAD DATA statements based on a selection of column names from the text files and corresponding database column names. I was hoping for a nice visual interface to do this with too. Perhaps I am just too lazy but I have used things to do this in the past in other fields but they weren't free.

ADD REPLYlink written 9.5 years ago by Pi520

It is very hard to judge what would be the right solution for you without seeing the actual data, but I understand your point better now. This might be completely off-topic, but have you had a look at this list?

ADD REPLYlink written 9.5 years ago by Joachim2.9k

Yes actually. Altova MapForce on that list is one I've looked at but its not free

ADD REPLYlink written 9.5 years ago by Pi520
gravatar for Neilfws
9.5 years ago by
Sydney, Australia
Neilfws49k wrote:

One solution to this type of problem is to get the data into a database first, then worry about the schema later.

A useful tool could be one of the so-called "NoSQL" databases. For example, MongoDB has import/export tools which allow import of CSV or TSV files. This gets you quickly to structured data; then you can put a schema on top using an ORM. For Ruby users, Mongoid and MongoMapper are both excellent options.

MongoDB does not export XML, but it will export JSON. It's relatively easy to either convert JSON to XML or export XML using a library. Again, for Ruby, Crack is a useful gem for XML/JSON parsing, interconversion and export from a hash structure.

In general, no tool will "understand your data"; you have to provide some definitions.

ADD COMMENTlink written 9.5 years ago by Neilfws49k

Thank you for your answer. I was expecting to provide definitions but i was hoping some tool would exist whereby i could supply a database schema or xsd and then map columns in my text files to these schema definitions then the tool could automagically populate a database or xml file with the data from my text files.

ADD REPLYlink written 9.5 years ago by Pi520
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1646 users visited in the last hour