How can I do an inner_join with multiple .csv files on the linux command line?
3
1
Entering edit mode
2.1 years ago
pavelasquezv ▴ 50

Hi all!

I have hundreds of .csv files with the same dimensions and the same column 1 (gene_name). I would like to do an inner_join in the terminal (linux command line) in order to have a single matrix.

This is the .csv files head SRR*_matrix_tpm.csv

==> SRR5488793_matrix_tpm.csv <==
gene_name       SRR5488793
NM_001042449.1  3752.38
NM_001043315.1  0
NM_001043316.2  1.31546
NM_001043358.1  0
NM_001043359.1  9.34679
NM_001043360.1  0.459982
NM_001043361.1  4.52898
NM_001043362.1  396.222
NM_001043363.1  101.929

==> SRR5488794_matrix_tpm.csv <==
gene_name       SRR5488794
NM_001042449.1  5352.09
NM_001043315.1  0.0205977
NM_001043316.2  3.23229
NM_001043358.1  0
NM_001043359.1  15.0625
NM_001043360.1  0.833219
NM_001043361.1  11.0039
NM_001043362.1  147.638
NM_001043363.1  22.1844

==> SRR5570692_matrix_tpm.csv <==
gene_name       SRR5570692
NM_001042449.1  5155.04
NM_001043315.1  1.74624
NM_001043316.2  18.6743
NM_001043358.1  0
NM_001043359.1  81.4916
NM_001043360.1  0
NM_001043361.1  16.1719
NM_001043362.1  34391.8
NM_001043363.1  3315.22

I would like to have something like this:

==> matrix.csv <==
gene_name   SRR5488793  SRR5488794  SRR5570692
NM_001042449.1  3752.38 5352.09 5155.04
NM_001043315.1  0,0 0.0205977   174624,0
NM_001043316.2  131546,0    323229,0    186743,0
NM_001043358.1  0,0 0,0 0,0
NM_001043359.1  934679,0    150625,0    814916,0
NM_001043360.1  0.459982    0.833219    0,0
NM_001043361.1  452898,0    110039,0    161719,0
NM_001043362.1  396222,0    147638,0    34391.8
NM_001043363.1  101929,0    221844,0    3315.22

Many thanks!

linux inner_join • 1.7k views
ADD COMMENT
3
Entering edit mode
2.1 years ago
mark.ziemann ★ 1.9k

Unix join command can only be used to join 2 tables together at a time. You will need a few piped commands to get the joined result.

$ join -1 1 -2 1 SRR5488793_matrix_tpm.csv SRR5488794_matrix_tpm.csv | tr ' ' '\t' | join -1 1 -2 1 - SRR5570692_matrix_tpm.csv | tr ' ' '\t' > joined.tsv`
$ cat joined.tsv 
gene_name   SRR5488793  SRR5488794  SRR5570692
NM_001042449.1  3752.38 5352.09 5155.04
NM_001043315.1  0   0.0205977   1.74624
NM_001043316.2  1.31546 3.23229 18.6743
NM_001043358.1  0   0   0
NM_001043359.1  9.34679 15.0625 81.4916
NM_001043360.1  0.459982    0.833219    0
NM_001043361.1  4.52898 11.0039 16.1719
NM_001043362.1  396.222 147.638 34391.8
NM_001043363.1  101.929 22.1844 3315.22

EDIT: the above only works if the files are sorted for the 'key', in this case column 1

If you have hundreds of files with the same length and order, you can use the cut command to create

cut -f1 SRR5488793_matrix_tpm.csv > rownames.txt
for CSV in *csv ; do
  cut -f2 $CSV > $CSV.cut
done
paste rownames.txt *cut > joined.tsv

Alternatively in R there is a join_all() function in the plyr package.

ADD COMMENT
0
Entering edit mode

Hi mark.ziemann, many thanks for your reply. However, I have hundreds of SRR*.csv files. Is it the same code? Many thanks!

ADD REPLY
0
Entering edit mode

I've edited my original post with another solution

ADD REPLY
0
Entering edit mode

Thanks a lot Mark Zeimann! All the best my friend!

ADD REPLY
2
Entering edit mode
2.1 years ago
Malcolm.Cook ★ 1.5k

I take

the same dimensions and the same column 1 (gene_name)

to mean that the first column is identical in all your files, as it is in your example data.

If indeed this is guaranteed, you really don't need to join on the first column. You know they are all the same! You can simply paste them together and cut out every other column.

I observe that there are no commas in your .csv files. Assuming they are tabs, the following works for 3 files:

 paste *.csv | cut -f 1,2,4,6
gene_name   SRR5488793  SRR5488794  SRR5570692
NM_001042449.1  3752.38 5352.09 5155.04
NM_001043315.1  0   0.0205977   1.74624
NM_001043316.2  1.31546 3.23229 18.6743
NM_001043358.1  0   0   0
NM_001043359.1  9.34679 15.0625 81.4916
NM_001043360.1  0.459982    0.833219    0
NM_001043361.1  4.52898 11.0039 16.1719
NM_001043362.1  396.222 147.638 34391.8
NM_001043363.1  101.929 22.1844 3315.22
`

or more generally (assuming your shell is bash on Gnu linux), use find, wc and seq to generate the sequence of indices you want to cut:

 paste *.csv | cut -f 1,$(seq -s , 2 2 $(( 2 * $(find *.csv | wc -l) )) ) 
gene_name   SRR5488793  SRR5488794  SRR5570692
NM_001042449.1  3752.38 5352.09 5155.04
NM_001043315.1  0   0.0205977   1.74624
NM_001043316.2  1.31546 3.23229 18.6743
NM_001043358.1  0   0   0
NM_001043359.1  9.34679 15.0625 81.4916
NM_001043360.1  0.459982    0.833219    0
NM_001043361.1  4.52898 11.0039 16.1719
NM_001043362.1  396.222 147.638 34391.8
NM_001043363.1  101.929 22.1844 3315.22
ADD COMMENT
0
Entering edit mode

Hi Malcolm Cook, many thanks for your reply. I'm going to try it right now! All the best!

ADD REPLY
1
Entering edit mode
2.1 years ago
Jesse ▴ 830

There's a CSV-finagling tool called XSV that supports inner joins between files (among many other things) though I haven't tried this program for myself yet: https://github.com/BurntSushi/xsv

ADD COMMENT
0
Entering edit mode

Many thanks for your reply Jesse! I will try!

ADD REPLY

Login before adding your answer.

Traffic: 1648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6