Question: intersection for several file
gravatar for mxlsherry1992
27 days ago by
mxlsherry199210 wrote:

Dear all,

If I have 6 files, each file only has one column (gene ID), I want to extracted the intersection for those 6 files, which command should I use? I know the below command can work for 2 files, but not for 6 files:

cat file1 file2 | sort | uniq -d > intersection.out

Thanks and have a great day!

sequencing rna-seq script • 120 views
ADD COMMENTlink modified 27 days ago by manuel.belmadani840 • written 27 days ago by mxlsherry199210

what about:

cat file1 file2 file3 file4 file5 file6 | sort | uniq -c | perl -lae 'print "$F[1]" if ($F[0] == 6)' > intersection.out

ADD REPLYlink written 27 days ago by JC7.9k
gravatar for manuel.belmadani
27 days ago by
manuel.belmadani840 wrote:

You pretty much had it, cat file* | sort | uniq -d > intersection.out will work for all files starting with file in the current directory.

If you need to find multiple files in different subdirectories, change the cat command for find /path/to/files -name "file*" -exec cat {} \;.

ADD COMMENTlink written 27 days ago by manuel.belmadani840

If by intersection OP means that the line has to be in every file, then the above does not work..

However (the same than JC's answer really)..

cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'
ADD REPLYlink written 27 days ago by 5heikki8.4k

It worked! Thank you

ADD REPLYlink written 26 days ago by mxlsherry199210

One caveat; if you have duplicates in your input files (i.e. if one file can have the same gene more than once) then you need to do something like sort | uniq before combining files, otherwise you'll get false postiives/negatives by looking for exactly 6 matches.

find /path/to/files -name "file*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; | sort | uniq -c | awk '{if($1==6){print $2}}'

Also the awk part explicitly looks 6 matches, so this won't work for arbitrary number of files but rather if there's exactly 6 files. Not a big deal if this is just a one-off use though.

ADD REPLYlink modified 24 days ago • written 25 days ago by manuel.belmadani840

Thanks for you kindly suggestions! I checked 6 files just now and they don't have replicates, so is it means the "cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'" results is realiable?:) and I also have a small question, it we want to use 4 files for example, maybe can simply change awk '{if($1==6){print $2}}' to awk '{if($1==4){print $2}}'..?

Thanks and have a great day!!

ADD REPLYlink written 24 days ago by mxlsherry199210

Yes should be good then, as far as I can see.

And yes, changing 6 to 4 would work if you have 4 files. If you think you'll have to do this in the future, you might want to consider writing a short bash script. Something like this works for arbitrary number of matching files:

set -eu

## Generate test data with:
# yes | head -n100  | xargs -I@  bash -c 'echo $RANDOM'  | grep -o . | head -n100 | split -l25 - file 

PREFIX=$1 # For example, files starting with "file"

NFILES=$(find . -name "${PREFIX}*" -type f | wc -l) # Get the number of files

find . -name "${PREFIX}*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; \
    | sort \
    | uniq -c \
    | awk -v nfiles="$NFILES" '{if($1==nfiles){print $2}}'

That way you can add your own sanity checks to make sure things like duplicates, number of files etc. are all as expected, plus it limits risks of mistyping a command.

ADD REPLYlink modified 24 days ago • written 24 days ago by manuel.belmadani840
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour