Data Manipulation Of Large Text Datasets
4
5
Entering edit mode
10.7 years ago
ff.cc.cc ★ 1.3k

It is hard to convince a whole lab to share data with standard formats. So I still spend too much time in data conversion work.

Can anyone suggest a definitive choice about which script or language or tool adopt to perform data manipulation (add/delete/split/merge rows or cols) on large (~GBytes) text files ?

I will also appreciate related links to books or tutorial online ? Thanks.

• 4.6k views
6
Entering edit mode
10.7 years ago
• Plain text tab/comma delimited files for large data sets.
• Get your lab members into the habit of saving their excel spreadsheets as csv or tab delimited.
• It's fine for them to copy and paste sequences into word and highlights regions if they are more comfortable with that, but get them to also save the sequence as a plain text fasta file.
• Simple AWK, SED, CUT commands can be used to manipulate the tab delimited files.
• I write python or bash scripts for frequently used file manipulation and store it on my system path. Stuff like finding intersection of 2 columns, get lines of a file where a column matches another file, transform matrix, sorting by column...
• If you are a mixed Mac/PC/Linux lab be aware that line breaks symbols could be different. OSX textedit will sometimes use carriage return, "\r" depending on encoding used.
• Don't bother investing in some kind of laboratory management system if you know the people in the lab won't use it
1
Entering edit mode

so true: Don't bother investing in some kind of laboratory management system. It always turns into a mess.

0
Entering edit mode

Thanks (+1). Any link to suggest ?

4
Entering edit mode
10.7 years ago

Perl or Python seem to be the natural choice for text manipulation. Of course there are many other languages that can do the job. "Learning perl" and the "Perl Cookbook" are two really good books for learning perl.

Well written scripts are usually fast enough to handle the largest of files.

0
Entering edit mode
4
Entering edit mode
10.7 years ago
Joachim ★ 2.9k

The UNIX command line tools grep, cut, sort and join are very powerful when it comes to rewriting large text files. A good overview on all of these tools as well as some other handy commands can be found here: http://rous.mit.edu/index.php/Unixcommandsappliedtobioinformatics

I did some benchmarking some time ago and it turns out that sort + join on the command line can perform faster than their equivalent join operation in MySQL/PostgreSQL. That is quite something.

The command line tools are also available on both Linux and Mac OS X systems (with slight differences in their parameters -- sometimes).

0
Entering edit mode

Good ! (+1), I have just bookmarked the link

1
Entering edit mode
10.6 years ago
gui11aume ▴ 10

I very much agree with @Dk. For simple reformatting, a sed/awk combination will do the job perfectly. I used this book to learn the syntax. There are perhaps better references, but this one was good to get me started.

Now, on a side note, you should pay attention to documenting data reformatting. This is usually where enormous mistakes are made (see Keith Baggerly's work for instance). If you opt for sed/awk one-liners, you should write a commented line in your files with the command that generated them.

For more complicated operations, I moved away from Perl and I use only Python nowadays. Python is much more readable than Perl, and even though I don't like it as much, this makes documenting a lot easier.

Finally, even if it is hard to convince a whole lab to use the same standards, that is the right ansewr to your question. For now the only cost is time, but down the line you will have space issues (you need to keep the original data plus the data you create), untraceability. When you have 5 different versions of a file running around there is no hope to make proper documentation of your work.