Question

Find Matched records in Multiple Files based on first column

0

Entering edit mode

7.5 years ago

khanirfanbioinformatics • 0

I have three files with multiple columns but the I want to find the common first column entries. The desired output is at the end of the post. The data entries in files are displayed below.

First file
"OG0000000"
"OG0000001"
"OG0000003"
"OG0000004"
"OG0000005"

second file
"OG0000000"
"OG0000003"
"OG0000004"
"OG0000005"
"OG0000006"


Third file
"OG0000000"
"OG0000004"
"OG0000005"
"OG0000006"
"OG0000007"

output file
"OG0000000"
"OG0000004"
"OG0000005"

awk • 1.4k views

ADD COMMENT • link updated 7.1 years ago by shenwei356 8.5k • written 7.5 years ago by khanirfanbioinformatics • 0

0

Entering edit mode

I want to find the common first column entries

That means there are more columns, not shown in your example?

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes there are multiple columns but I want to take out the list of the first column in a separate file so that I can grep them from individual files later.

ADD REPLY • link 7.5 years ago by khanirfanbioinformatics • 0

0

Entering edit mode

So it's not a problem to lose the other columns now? In that case, something like this: (not tested)

cat <(cut -f1 file1.txt) <(cut -f1 file2.txt) | sort | uniq -d > file12.common.txt
cat file12.common.txt <(cut -f1 file3.txt) | sort | uniq -d > file123.common.txt

But this assumes your file is tab delimited.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

score 2 · Answer 1 · 2016-11-21

2

Entering edit mode

7.5 years ago

Santosh Anand 5.7k

I'll give you a quick but 'dirty' solution:

cat file1 file2 fil3 | cut -f1 | sort |  uniq -c | grep "3 " |  cut -d "\"" -f2

ADD COMMENT • link 7.4 years ago by Santosh Anand 5.7k

0

Entering edit mode

If you sort uniq first column from each file separetely first, then this is great idea, otherwise it assumes each file has only single entry for every element in the first columns. So if the first file was:

"OG0000000"
"OG0000001"
"OG0000003"
"OG0000003"
"OG0000005"

Instead. Your command line would output the result below, right?

"OG0000000"
"OG0000003"
"OG0000005"

ADD REPLY • link 7.1 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

Sure. That's why it's a quick n "dirty" solution. No bioinformatician will hard code grep "3 " in his/her program :)

ADD REPLY • link 7.1 years ago by Santosh Anand 5.7k

score 0 · Answer 2 · 2017-04-09

0

Entering edit mode

7.1 years ago

shenwei356 8.5k

One command using csvtk for tab-delimited files. Usage of csvtk inter.

$ csvtk inter --no-header-row --tabs file*
OG0000000
OG0000004
OG0000005

ADD COMMENT • link 7.1 years ago by shenwei356 8.5k