Merging lots of csv files in on master one
1
0
Entering edit mode
3.4 years ago
psschlogl ▴ 50

I guys I have a directory with a bunch of sub directories with lots of csvs. My csvs have two columns (kmer, counts). For each of the sub directories csvs I want keep the first column (that is share with all files) and merge the second column (counts). Ex:

cut -d , -f 2  sorted_2.csv | paste -d , sorted_1.csv > combo_2.csv

k1,cnt1, cnt2, cnt3...
k2,cnt1, cnt2, cnt3...
k3,cnt1, cnt2, cnt3...

It works fine with the toys test files. I tried to make a script like this:

input="csv_list.txt"

while IFS= read -r line
do
  paste -d, combo_files.csv <(cut -d, -f2 $line)
done < "$input"

But got no look yet, because it paste only one column.

What can I improve in this script?

Thanks

bash csv • 1.5k views
ADD COMMENT
0
Entering edit mode

I'd recommend using R. You can then list.files to generate a list of the CSV files, then lapply the function read.table along this list to get a list of data frames, and at the end, use Reduce(merge, list_of_data_frames) to get a single data frame.

ADD REPLY
0
Entering edit mode

I was trying just to avoid using lots of memory loading all that data in my pc. I can try use python, but I want to do this steps in shell. But I appreciate you time. Thank you very much. paulo

ADD REPLY
1
Entering edit mode

In that case, you can split the list into 3-4 chunks, but I don't think you'll use a lot of memory. If you're still particular about using bash, try join instead of paste.

ADD REPLY
0
Entering edit mode

I will try it. Thank you

ADD REPLY
1
Entering edit mode
3.4 years ago
steve ★ 3.5k

You can do it easily in Python

#!/usr/bin/env python3
"""
$ cat file1.csv
k1,c1
k2,c1
k3,c1

$ cat file2.csv
k1,c2
k2,c2
k3,c2

$ cat file3.csv
k1,c3
k2,c3
k3,c3

$ ./script.py file1.csv file2.csv file3.csv
k1,c1,c2,c3
k2,c1,c2,c3
k3,c1,c2,c3
"""
import sys
import csv
files = sys.argv[1:]
file_handles = [ open(f) for f in files ]
readers = [ csv.reader(f) for f in file_handles ]

keep_going = True
while keep_going:
    try:
        lines = [ next(f) for f in readers ]
        collapse_part = ','.join([ l[1] for l in lines ])
        new_line = ','.join([ lines[0][0], collapse_part ])
        print(new_line)
    except StopIteration:
        keep_going = False

[ f.close() for f in file_handles ]
ADD COMMENT
0
Entering edit mode

thanks for your attention and time Steve. 8)

ADD REPLY

Login before adding your answer.

Traffic: 3170 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6