Question: Memory-Efficient Way To Transpose Csv/Text File?
0
gravatar for Click downvote
4.5 years ago by
Germany
Click downvote600 wrote:

I've got a few giant text files I'd like to transpose and was wondering whether there was any simple way of doing this. They are in various formats, but "matrixy" - that is same number of columns, no missing values.

If I do not want to keep the whole thing in memory I can't think of any way to do it without n*k complexity, where n is the number of lines and k is the number of columns. As the files contain floating point values of varying length the lines aren't the same length I don't think I can read line by line by choosing a specific number of bytes/characters.

Would be nice if a standard tool like plink could do it, but it only seems to handle ped files.

Ps. there are so many columns that awk is unlikely to be up to the task - has a max NF of 32k IIRC.

matrix scripting • 4.5k views
ADD COMMENTlink modified 3 months ago by Nari840 • written 4.5 years ago by Click downvote600

R has a transpose function, have you tried that? Don't re-invent the wheel if you don't have to.

ADD REPLYlink written 3 months ago by swbarnes23.4k
4
gravatar for Chris Miller
4.5 years ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

If you want to limit yourself based on I/O, rather than memory, you can do the following (really ugly pseudocode that should give you the general idea):

#split each column into it's own file
for each line in matrix
     array = split line
     for i in 1:length(array)
          print file$i array[$i] . "\t";

#open each file sequentially, output its data as a row in the output file:
foreach file (glob(files))
    open file
    line = getline(file)
    print finalfile line . "\n"

#clean up temp files, etc
ADD COMMENTlink written 4.5 years ago by Chris Miller19k
1

If your matrix has more than 1021 columns on Linux or 253 columns on Mac OS X, you may run out of file handles to allocate to columns. For more details, see results from: https://www.google.com/search?q=file+descriptor+limit

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds23k

One way to deal with that is to change kernel parameters. Another way would be to operate on smaller groups of columns, always working with fewer columns than available file descriptors. You'd need to walk through the input file however many column-multiples of the file descriptor limit n, appending columns 1..n to the output rows, then columns (n+1)..2n, etc.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds23k

Will try this and test against my version. Upvote.

ADD REPLYlink written 4.5 years ago by Click downvote600
3
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds23k wrote:

Can you use disk as intermediate storage?

If you don't want to keep the entire matrix in (system) memory before writing it out, perhaps you can write placeholder bytes to disk directly, and copy over one cell at a time:

  1. Walk through your input nxm matrix to determine maximum precision.
  2. Write out an intermediate mxn CSV matrix file, containing zero-padded elements to the precision found in step 1.

    This creates an empty CSV matrix with equally-sized cells. In C, you could use one-byte char elements for commas and numbers. This is sized in such a way that you can write a function to directly translate or map a position in your input matrix to a byte offset in the empty CSV matrix.

  3. Walk through the input matrix again, writing cell ij from your input matrix over the cell ji in your empty CSV output matrix.

    In C, use fseek(), for example, to jump to the calculated byte offset for cell ji. Then fwrite() a zero-padded string from the original matrix's cell ij to this stream's offset (leaving commas in place).

  4. Clean things up: Walk through the intermediate CSV matrix one row at a time, stripping zero-pad characters and writing this stripped matrix to the final, transposed output.

The time complexity is O(nm) from walking through your nxm matrix four times. Since cells in the intermediate output matrix are equally spaced, mapping cells from the input to output matrix is O(1). This should be very (system) memory efficient as you're only storing one cell at a time in memory, reading/writing that cell from disk. You could need a fair amount of disk storage, however, for caching the intermediate matrix, depending on the desired precision.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds23k
1
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum106k wrote:

I quiclky wrote a tool to transpose a big text. It uses the class SortingCollection of the picard java API. https://github.com/lindenb/jvarkit/wiki/Biostar84786

First the data are read and inserted in the SortingCollection

                            for(;;)
                                {
                                int c=in.read();
                                if(c=='\n' || c==-1)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        row++;
                                        col=0;
                                        b.setLength(0);
                                        if(c==-1) break;
                                        if(row%10000==0) LOG.info("row:"+row);
                                        }
                                else if(c==delimiter)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        b.setLength(0);
                                        col++;
                                        }
                                else
                                        {
                                        b.append((char)c);
                                        }
                                }

then we ouput the data using the new order:

                        long x=0L;
                        for(;;)
                                {

                                if(!iter.hasNext())
                                        {
                                        System.out.println();
                                        break;
                                        }
                                Cell c=iter.next();
                                if(c.col!=curr_col)
                                        {
                                        if(curr_col!=-1L) System.out.println();
                                        x=0L;
                                        curr_col=c.col;
                                        }
                                if(x>0L) System.out.print(DELIM);
                                System.out.print(c.content);
                                x++;
                                }
ADD COMMENTlink written 4.5 years ago by Pierre Lindenbaum106k
0
gravatar for Click downvote
4.5 years ago by
Germany
Click downvote600 wrote:

This seems to work, but feels kinda stupid and is slower than slow:

def transpose_matrix_memory_efficiently(input_file, output_delimiter="\t"):

    with open(input_file) as original_file, open(input_file+"_t","w+") as transposed_file:

        number_of_columns = len(original_file.readline().split())

        original_file.seek(0)

        for current_column in range(number_of_columns):

           temporary_line = []

           for line in original_file:

               value = line.split()[current_column]

           temporary_line.append(value)

           transposed_line = output_delimiter.join(temporary_line) + "\n"

           transposed_file.write(transposed_line)

           original_file.seek(0)
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Click downvote600
0
gravatar for Nari
3 months ago by
Nari840
United States
Nari840 wrote:

Try simple perl script here : https://github.com/9aren/Transpose_Matrix/blob/master/Transpose_tsv.pl
(No idea if it is memory efficient or not)

ADD COMMENTlink written 3 months ago by Nari840

That script explicitly reads the entire file into an array up front, so no, it is not memory-efficient and doesn't answer the question.

ADD REPLYlink written 3 months ago by Chris Miller19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 750 users visited in the last hour