Question

Unique lines among 200 files

0

Entering edit mode

4.4 years ago

User000 ▴ 690

I have 200 txt files with 1 line containing the name of the read. The command line below finds the intersection.

cat *.mapped.txt | sort | uniq -d > intersection.out

How to find unique reads among these 200 files?

My files are called:

accepted.name.mapped.txt
...

The reads are like this:

HISEQ1:105:C0A57ACXX:2:1105:12172:84568
HISEQ1:105:C0A57ACXX:2:1108:17762:41110
HISEQ1:105:C0A57ACXX:2:1204:3007:9349
HISEQ1:105:C0A57ACXX:2:1204:11087:160507
HISEQ1:105:C0A57ACXX:2:1301:18982:79651
HISEQ1:105:C0A57ACXX:2:1307:3766:23853

bash NGS • 582 views

ADD COMMENT • link updated 4.4 years ago by Pierre Lindenbaum 161k • written 4.4 years ago by User000 ▴ 690

score 2 · Answer 1 · 2019-12-12

2

Entering edit mode

4.4 years ago

Pierre Lindenbaum 161k

cat *.mapped.txt | sort | uniq -u

??

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, I was thinking about this, but I wasn't sure....will it be very slow on my 200 files 800 MB each?

ADD REPLY • link 4.4 years ago by User000 ▴ 690

1

Entering edit mode

should be faster:

cat *.mapped.txt | LC_ALL=C sort -T .  --buffer-size=5G | LC_ALL=C uniq -u

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

sort -m *.mapped.txt | uniq -u > uniques.txt may be this one works as well...

ADD REPLY • link 4.4 years ago by User000 ▴ 690