Best Way To Concatenate Columns In Multiple Csv Files Of Differing Row Lengths?
3
3
Entering edit mode
11.3 years ago

I am trying to think of the best (most efficient) way to concatenate multiple CSV files into one file using Perl? The files contain frequency distributions for EnsEMBL sequence data. I aim to pass the merged CSV file to R for plotting.

The input CSV files have two columns for size and freqency e.g.

name1.size,name1.frequency
1,10
2,30
3,20
4,70
5,500


I need to concatenate these into one CSV file, so all the columns from all the files are maintained. However, the columns can differ in row length. I require the following output for example:

name1.size,name1.frequency,name2.size,name2.frequency,name3.size,name3.frequency
1,10,1,20,1,30
2,30,2,10,2,50
3,20,4,70,3,10
4,70,,,5,300
5,500


I'm struggling to think of the best solution to achieve this. I was thinking perhaps a hash of arrays? I've been using Text::CSV_XS, but I can't see a way to write columns, only rows?

for my $organism (@organisms) { # setup CSV my$csv = Text::CSV_XS->new ({ binary => 1 });

# open file
my $in_file = File::Spec->catfile($path, $organism,$feature . "_freqs.csv");
open my $fh, "<", "$in_file" or die "$in_file:$!";

# traverse file and push to
my $rows = []; while (my$row = $csv->getline($fh)) {
push(@$rows,$row);
}

$csv_hash{$organism} = $rows; # close the CSV$csv->eof or $csv->error_diag; close$fh or die "$in_file:$!";
}


I'm doing the above at the moment and then thought about iterating over the hash keys and writing out the columns? Any other ideas?

Update:

I'm doing this to order by size (descending) of hash contents:

# get the size of the array in each hash array and sort in descending size order
foreach my $k (sort {scalar(@{$csv_hash{$b}}) <=> scalar(@{$csv_hash{$a}})} keys %csv_hash) { push(@order,$k);
}


I could then iterate through and build each line of the CSV from the csv_hash?

perl array parsing • 20k views
2
Entering edit mode

Be careful with the word "merge". It implies the need to deal with situations such as: redundancy (duplicate rows) or rows that share some kind of key (e.g. row name). It sounds to me as though you just want to concatenate the files (= cbind in R) ?

1
Entering edit mode

R is much better at handling tabular data anyway. stick with R for this step.

1
Entering edit mode
0
Entering edit mode

I think perhaps the best way would be to iterate through the hash keys and output to CSV in descending order? This way I will ensure that there are no column alignment issues?

0
Entering edit mode

I'm not sure that your question is really related to bioinformatics.... Nevertheless, I don't understand why you want to use perl instead of a simple 'sort| uniq' pipeline ?

0
Entering edit mode

Hi Pierre, I'm working with frequency distributions from bioinformatics data, although I suppose it isn't pure bioinformatics, it is related!

I'm open to suggestions! I need to merge the CSV files to pass them to R for analysis, so whichever works best?

0
Entering edit mode

Updated the question with that information!

0
Entering edit mode

Yes, I had thought about this, but then if I have 100 CSV files and I need to share this data, or perform statistical analysis on the data as a whole, then it makes it more difficult!

0
Entering edit mode

Actually, I think you might be right... I can pass the files to R, get it to use cbind and then write a CSV from the columns? Then pull the filename for that back to Perl!

0
Entering edit mode
0
Entering edit mode

My apologise, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

0
Entering edit mode

My apologies, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

7
Entering edit mode
11.3 years ago

If you want to merge two files to pass them to R, I suggest you to merge them in R directly.

R is a lot better than perl for handling tabular data: for example, in perl there is not a builtin equivalent of the data.frame object, or of the read.csv functions. The same is valid for python and other languages; the fact is that R is a language for analyzing and visualizing data, while the others are general purpose languages.

In any case, you can easily merge two tabular files in R with the merge function from the base package:

> data1 <- read.csv('file1.txt')
> merged_data <- merge(data1, data2, by.x="column_in_data1", by.y="column_in_data2")


Have a look at ?merge for further information. You can also define other types of JOIN and select only some columns.

An alternative is to upload the data into a database and use a SQL call.

1
Entering edit mode

thanks gawbul! base::merge should handle records with different lenght as well, but if you say that the other is faster, I will try it.

1
Entering edit mode

Another alternative is to use libraries like sqldf and use SQL instructions.

0
Entering edit mode

I found a great little library called gdata that has a function called cbindX that merges columns with differing row lengths. It seems to work faster the merge! I'm working on it at the moment!

0
Entering edit mode
3
Entering edit mode
11.3 years ago
brentp 24k

If you sort your files, you can join them using the linux command join.

Given files a.txt:

a 10
b 22
d 33
f 44


and b.txt:

a 10
b 22
c 33
e 44


where the first column is the id to join on, you can do:

\$ join -a 1 -a 2 -e " " a.txt b.txt


to get:

a 10 10
b 22 22
c 33
d 33
e 44
f 44


From there, you can easily add all columns after the first in your language of choice.

NOTE: the files have to be sorted on the column you're joining on.

NOTE: you can adjust the separator with the -e flag.

see the man page

0
Entering edit mode

Hi brentp, thanks for the reply. I would need to maintain the columns in parallel, rather than merge the alpha column in your example. So, the pairs of columns for a.txt and b.txt would be appended horizontally. Your example, is useful however, for something else I am planning to do later in the project.

0
Entering edit mode

My apologies, as neilfws pointed out, my question was confusing! I need to concatenate the columns, as opposed to merge.

0
Entering edit mode
11.0 years ago

I found the R package gdata on CRAN.

This provides the cbindX function.

I can do this:

library(gdata)
concat_data <- cbindX(data1, data2, data3)

2
Entering edit mode

Even though you think this might be the right answer, it is not advisable to use this function in most cases, instead it's the road into disaster in most cases. It is unsafe because it assumes either the order of rows doesn't matter or the rows left out occur only at the end. However, in bioinformatics, rows often correspond to identifiable entities (aka genes) and are better merged on an identifier column as suggested by Giovanni. In fact, imho, it would be better the cbindX function would not exist at all.

1
Entering edit mode

Well, you edited your question to fit your answer, that's nice, I would take it back now, but now I can't, unless you edit your question, too. The -1 was because your accepted your own (problematic (or wrong, before edit) answer as pointed out in my first comment prior to your edit which replaced 'merge' with 'concatenate'). So, the -1 was meant as a reward for boldness ;)

0
Entering edit mode

The order makes no difference in this case, so is not an issue to me! Perhaps this may not be applicable to some others, but is relevant within the context that I require it. Thanks for pointing that out though :-)

0
Entering edit mode

Also, perhaps this was partly confusion on my part with my original question, but I aren't merging columns, and am instead concatenating them!

0
Entering edit mode

Also, merge only takes two data frames at a time!

0
Entering edit mode

Not sure why -1 here, as it is relevant to my question!?