Question: Best Way To Concatenate Columns In Multiple Csv Files Of Differing Row Lengths?
3
gravatar for Steve Moss
9.1 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I am trying to think of the best (most efficient) way to concatenate multiple CSV files into one file using Perl? The files contain frequency distributions for EnsEMBL sequence data. I aim to pass the merged CSV file to R for plotting.

The input CSV files have two columns for size and freqency e.g.

name1.size,name1.frequency
1,10
2,30
3,20
4,70
5,500

I need to concatenate these into one CSV file, so all the columns from all the files are maintained. However, the columns can differ in row length. I require the following output for example:

name1.size,name1.frequency,name2.size,name2.frequency,name3.size,name3.frequency
1,10,1,20,1,30
2,30,2,10,2,50
3,20,4,70,3,10
4,70,,,5,300
5,500

I'm struggling to think of the best solution to achieve this. I was thinking perhaps a hash of arrays? I've been using Text::CSV_XS, but I can't see a way to write columns, only rows?

for my $organism (@organisms) {
    # setup CSV
    my $csv = Text::CSV_XS->new ({ binary => 1 });

    # open file
    my $in_file = File::Spec->catfile($path, $organism, $feature . "_freqs.csv");
      open my $fh, "<", "$in_file" or die "$in_file: $!";

    # traverse file and push to 
    my $rows = [];
    while (my $row = $csv->getline($fh)) {
         push(@$rows, $row);
    }

    $csv_hash{$organism} = $rows;

    # close the CSV
    $csv->eof or $csv->error_diag;
    close $fh or die "$in_file: $!";
}

I'm doing the above at the moment and then thought about iterating over the hash keys and writing out the columns? Any other ideas?

Update:

I'm doing this to order by size (descending) of hash contents:

# get the size of the array in each hash array and sort in descending size order
foreach my $k (sort {scalar(@{$csv_hash{$b}}) <=> scalar(@{$csv_hash{$a}})} keys %csv_hash) {
    push(@order, $k);        
}

I could then iterate through and build each line of the CSV from the csv_hash?

perl array parsing • 17k views
ADD COMMENTlink modified 5.3 years ago by pld4.8k • written 9.1 years ago by Steve Moss2.3k
2

Be careful with the word "merge". It implies the need to deal with situations such as: redundancy (duplicate rows) or rows that share some kind of key (e.g. row name). It sounds to me as though you just want to concatenate the files (= cbind in R) ?

ADD REPLYlink written 9.1 years ago by Neilfws48k
1

R is much better at handling tabular data anyway. stick with R for this step.

ADD REPLYlink written 9.1 years ago by Jeremy Leipzig19k
1

Yup, there's also smartbind() in the gtools package.

ADD REPLYlink modified 5 months ago by RamRS25k • written 9.1 years ago by Neilfws48k

I think perhaps the best way would be to iterate through the hash keys and output to CSV in descending order? This way I will ensure that there are no column alignment issues?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

I'm not sure that your question is really related to bioinformatics.... Nevertheless, I don't understand why you want to use perl instead of a simple 'sort| uniq' pipeline ?

ADD REPLYlink written 9.1 years ago by Pierre Lindenbaum126k

Hi Pierre, I'm working with frequency distributions from bioinformatics data, although I suppose it isn't pure bioinformatics, it is related!

I'm open to suggestions! I need to merge the CSV files to pass them to R for analysis, so whichever works best?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Updated the question with that information!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Yes, I had thought about this, but then if I have 100 CSV files and I need to share this data, or perform statistical analysis on the data as a whole, then it makes it more difficult!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Actually, I think you might be right... I can pass the files to R, get it to use cbind and then write a CSV from the columns? Then pull the filename for that back to Perl!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

This looks perfect http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gdata/html/cbindX.html

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

My apologise, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

My apologies, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
7
gravatar for Giovanni M Dall'Olio
9.1 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

If you want to merge two files to pass them to R, I suggest you to merge them in R directly.

R is a lot better than perl for handling tabular data: for example, in perl there is not a builtin equivalent of the data.frame object, or of the read.csv functions. The same is valid for python and other languages; the fact is that R is a language for analyzing and visualizing data, while the others are general purpose languages.

In any case, you can easily merge two tabular files in R with the merge function from the base package:

> data1 <- read.csv('file1.txt')
> data2 <- read.csv('file2.txt')
> merged_data <- merge(data1, data2, by.x="column_in_data1", by.y="column_in_data2")

Have a look at ?merge for further information. You can also define other types of JOIN and select only some columns.

An alternative is to upload the data into a database and use a SQL call.

ADD COMMENTlink modified 5 months ago by RamRS25k • written 9.1 years ago by Giovanni M Dall'Olio26k
1

thanks gawbul! base::merge should handle records with different lenght as well, but if you say that the other is faster, I will try it.

ADD REPLYlink written 9.1 years ago by Giovanni M Dall'Olio26k
1

Another alternative is to use libraries like sqldf and use SQL instructions.

ADD REPLYlink written 9.1 years ago by Giovanni M Dall'Olio26k

I found a great little library called gdata that has a function called cbindX that merges columns with differing row lengths. It seems to work faster the merge! I'm working on it at the moment!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Link here: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gdata/html/cbindX.html

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
3
gravatar for brentp
9.1 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

If you sort your files, you can join them using the linux command join.

Given files a.txt:

a 10
b 22
d 33
f 44

and b.txt:

a 10
b 22
c 33
e 44

where the first column is the id to join on, you can do:

$ join -a 1 -a 2 -e " " a.txt b.txt

to get:

a 10 10
b 22 22
c 33
d 33
e 44
f 44

From there, you can easily add all columns after the first in your language of choice.

NOTE: the files have to be sorted on the column you're joining on.

NOTE: you can adjust the separator with the -e flag.

see the man page

ADD COMMENTlink modified 5 months ago by RamRS25k • written 9.1 years ago by brentp23k

Hi brentp, thanks for the reply. I would need to maintain the columns in parallel, rather than merge the alpha column in your example. So, the pairs of columns for a.txt and b.txt would be appended horizontally. Your example, is useful however, for something else I am planning to do later in the project.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

My apologies, as neilfws pointed out, my question was confusing! I need to concatenate the columns, as opposed to merge.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
0
gravatar for Steve Moss
8.8 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I found the R package gdata on CRAN.

This provides the cbindX function.

I can do this:

library(gdata)
data1 <- read.csv('file1.csv')
data2 <- read.csv('file2.csv')
data3 <- read.csv('file3.csv')
concat_data <- cbindX(data1, data2, data3)
ADD COMMENTlink modified 5 months ago by RamRS25k • written 8.8 years ago by Steve Moss2.3k
2

Even though you think this might be the right answer, it is not advisable to use this function in most cases, instead it's the road into disaster in most cases. It is unsafe because it assumes either the order of rows doesn't matter or the rows left out occur only at the end. However, in bioinformatics, rows often correspond to identifiable entities (aka genes) and are better merged on an identifier column as suggested by Giovanni. In fact, imho, it would be better the cbindX function would not exist at all.

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k
1

Well, you edited your question to fit your answer, that's nice, I would take it back now, but now I can't, unless you edit your question, too. The -1 was because your accepted your own (problematic (or wrong, before edit) answer as pointed out in my first comment prior to your edit which replaced 'merge' with 'concatenate'). So, the -1 was meant as a reward for boldness ;)

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

The order makes no difference in this case, so is not an issue to me! Perhaps this may not be applicable to some others, but is relevant within the context that I require it. Thanks for pointing that out though :-)

ADD REPLYlink written 8.8 years ago by Steve Moss2.3k

Also, perhaps this was partly confusion on my part with my original question, but I aren't merging columns, and am instead concatenating them!

ADD REPLYlink written 8.8 years ago by Steve Moss2.3k

Also, merge only takes two data frames at a time!

ADD REPLYlink written 8.8 years ago by Steve Moss2.3k

Not sure why -1 here, as it is relevant to my question!?

ADD REPLYlink written 8.8 years ago by Steve Moss2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1625 users visited in the last hour