Question

Memory-Efficient Way To Transpose Csv/Text File?

0

Entering edit mode

10.8 years ago

Click downvote ▴ 720

I've got a few giant text files I'd like to transpose and was wondering whether there was any simple way of doing this. They are in various formats, but "matrixy" - that is same number of columns, no missing values.

If I do not want to keep the whole thing in memory I can't think of any way to do it without n*k complexity, where n is the number of lines and k is the number of columns. As the files contain floating point values of varying length the lines aren't the same length I don't think I can read line by line by choosing a specific number of bytes/characters.

Would be nice if a standard tool like plink could do it, but it only seems to handle ped files.

Ps. there are so many columns that awk is unlikely to be up to the task - has a max NF of 32k IIRC.

matrix scripting • 11k views

ADD COMMENT • link updated 5.2 years ago by Biostar 20 • written 10.8 years ago by Click downvote ▴ 720

0

Entering edit mode

R has a transpose function, have you tried that? Don't re-invent the wheel if you don't have to.

ADD REPLY • link 6.6 years ago by swbarnes2 14k

score 4 · Answer 1 · 2013-10-28

4

Entering edit mode

10.8 years ago

Chris Miller 22k

If you want to limit yourself based on I/O, rather than memory, you can do the following (really ugly pseudocode that should give you the general idea):

#split each column into it's own file
for each line in matrix
     array = split line
     for i in 1:length(array)
          print file$i array[$i] . "\t";

#open each file sequentially, output its data as a row in the output file:
foreach file (glob(files))
    open file
    line = getline(file)
    print finalfile line . "\n"

#clean up temp files, etc

ADD COMMENT • link 10.8 years ago by Chris Miller 22k

1

Entering edit mode

If your matrix has more than 1021 columns on Linux or 253 columns on Mac OS X, you may run out of file handles to allocate to columns. For more details, see results from: https://www.google.com/search?q=file+descriptor+limit

ADD REPLY • link 10.8 years ago by Alex Reynolds 35k

0

Entering edit mode

One way to deal with that is to change kernel parameters. Another way would be to operate on smaller groups of columns, always working with fewer columns than available file descriptors. You'd need to walk through the input file however many column-multiples of the file descriptor limit n, appending columns 1..n to the output rows, then columns (n+1)..2n, etc.

ADD REPLY • link 10.7 years ago by Alex Reynolds 35k

0

Entering edit mode

Will try this and test against my version. Upvote.

ADD REPLY • link 10.8 years ago by Click downvote ▴ 720

score 3 · Answer 2 · 2013-10-28

Can you use disk as intermediate storage?

If you don't want to keep the entire matrix in (system) memory before writing it out, perhaps you can write placeholder bytes to disk directly, and copy over one cell at a time:

Walk through your input nxm matrix to determine maximum precision.
Write out an intermediate mxn CSV matrix file, containing zero-padded elements to the precision found in step 1.

This creates an empty CSV matrix with equally-sized cells. In C, you could use one-byte char elements for commas and numbers. This is sized in such a way that you can write a function to directly translate or map a position in your input matrix to a byte offset in the empty CSV matrix.
Walk through the input matrix again, writing cell ij from your input matrix over the cell ji in your empty CSV output matrix.

In C, use fseek(), for example, to jump to the calculated byte offset for cell ji. Then fwrite() a zero-padded string from the original matrix's cell ij to this stream's offset (leaving commas in place).
Clean things up: Walk through the intermediate CSV matrix one row at a time, stripping zero-pad characters and writing this stripped matrix to the final, transposed output.

The time complexity is O(nm) from walking through your nxm matrix four times. Since cells in the intermediate output matrix are equally spaced, mapping cells from the input to output matrix is O(1). This should be very (system) memory efficient as you're only storing one cell at a time in memory, reading/writing that cell from disk. You could need a fair amount of disk storage, however, for caching the intermediate matrix, depending on the desired precision.

score 1 · Answer 3 · 2013-10-28

I quiclky wrote a tool to transpose a big text. It uses the class SortingCollection of the picard java API. https://github.com/lindenb/jvarkit/wiki/Biostar84786

First the data are read and inserted in the SortingCollection

                            for(;;)
                                {
                                int c=in.read();
                                if(c=='\n' || c==-1)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        row++;
                                        col=0;
                                        b.setLength(0);
                                        if(c==-1) break;
                                        if(row%10000==0) LOG.info("row:"+row);
                                        }
                                else if(c==delimiter)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        b.setLength(0);
                                        col++;
                                        }
                                else
                                        {
                                        b.append((char)c);
                                        }
                                }

then we ouput the data using the new order:

                        long x=0L;
                        for(;;)
                                {

                                if(!iter.hasNext())
                                        {
                                        System.out.println();
                                        break;
                                        }
                                Cell c=iter.next();
                                if(c.col!=curr_col)
                                        {
                                        if(curr_col!=-1L) System.out.println();
                                        x=0L;
                                        curr_col=c.col;
                                        }
                                if(x>0L) System.out.print(DELIM);
                                System.out.print(c.content);
                                x++;
                                }

score 0 · Answer 4 · 2013-10-28

This seems to work, but feels kinda stupid and is slower than slow:

def transpose_matrix_memory_efficiently(input_file, output_delimiter="\t"):

    with open(input_file) as original_file, open(input_file+"_t","w+") as transposed_file:

        number_of_columns = len(original_file.readline().split())

        original_file.seek(0)

        for current_column in range(number_of_columns):

           temporary_line = []

           for line in original_file:

               value = line.split()[current_column]

           temporary_line.append(value)

           transposed_line = output_delimiter.join(temporary_line) + "\n"

           transposed_file.write(transposed_line)

           original_file.seek(0)

score 0 · Answer 5 · 2018-01-01

0

Entering edit mode

6.6 years ago

Naren ▴ 1000

Try simple perl script here : https://github.com/9aren/Transpose_Matrix/blob/master/Transpose_tsv.pl
(No idea if it is memory efficient or not)

ADD COMMENT • link 6.6 years ago by Naren ▴ 1000

0

Entering edit mode

That script explicitly reads the entire file into an array up front, so no, it is not memory-efficient and doesn't answer the question.

ADD REPLY • link 6.6 years ago by Chris Miller 22k