Looking for sample fixed column width data files
1
0
Entering edit mode
3.6 years ago
andy • 0

I understand that fixed column width data is quite common in bioinformatics. We are currently adding support for fixed column width data files into Easy Data Transform . One of the things we are trying to do is have the software automatically try to detect where the column boundaries are, so you don't have to set them manually. We need some good sample data for that. Is there somewhere I can download some sample fixed column width data files? Or can someone send me some examples? Anything from a few hundred to a few million rows would be good.

data • 756 views
ADD COMMENT
0
Entering edit mode

Thanks for the link.

I've used awk/sed/grep/perl in the past.

  • There is quite a learning curve.
  • They aren't very visual.
  • It is a lot more time consuming to write Awk than snap together a few transforms ('Bailing out at line 1"!).
  • Some things probably aren't practical in Awk or similar e.g. parsing JSON or XML or doing a pivot table (unless Ask has changed a lot since I last used it!).

So Easy Data Transform might be a better choice for some people in some situations.

ADD REPLY
0
Entering edit mode

Please don't add answers unless you're answering the principal question. Use Add Comment or Add Reply instead.

ADD REPLY
0
Entering edit mode

If you're going to call yourself a bioinformaticist, you're probably going to be familiar with one or more of those tools. If I have to open a GUI to transform my data, that's going to be a manual process in my pipeline every time. I have to figure out a given awk/sed/grep/cut/paste munge once and I can run it on my 250 samples no problem. I don't see why these transformations have to be visual; pipes and previewing output are both pretty easy.

As for XML/JSON, they are indeed annoying to work with, but there are numerous dedicated, free CLI parsers that handle them just fine.

I suppose you're right about more Excel-like functions, but there is also Knime, which covers most of that functionality if really needed. For most of the fixed formats I linked, there exist other very popular tools to munge them in common ways, e.g. bedtools, bedops, and bcftools.

Regardless, I wish you luck. There may very well be folks who find a GUI for such transformations useful in certain situations.

ADD REPLY
0
Entering edit mode

Pipes work fine if it is a linear pipeline. But a visual layout works better for a graph IMHO. For example blending multiuple input sources and then creating multiple outputs.

ADD REPLY
2
Entering edit mode
3.6 years ago

See here for many examples. Don't see you getting much play around here though, when awk/sed/grep and friends are nearly infinitely flexible. And free.

ADD COMMENT
0
Entering edit mode

Based on those examples, I tweaked the herutistic we use to try to automatically work out where the column boundaries are. It seems to work pretty well, even when there are repeating header and footer records. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1390 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6